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Sir: 

This Appeal Brief is submitted in response to the final rejections of the claims dated 
August 17, 2006. A Notice of Appeal was filed on October 5, 2006. 
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REAL PARTY IN INTEREST 

The real party in interest is Hewlett-Packard Development Company, LP, a limited 
partnership established under the laws of the State of Texas and having a principal plaee of 
business at 20555 STL 249 Houston, Texas 77070, USA (hereinafter "HPDC"). HPDC is a 
wholly owned affiliate of Hewlett-Packard Company, a Delaware Corporation, headquartered 
in Palo Alto, CA. The general or managing partner of HPDC is HPQ Holdings, LLC, 
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RELATED APPEALS AND INTERFERENCES 

There are currently no related appeals or interferences known to Appellant, Appellant's 
legal representative, or the assignee which will directly affect, or be directly affected by, or have 
a bearing on, the Board's decision. 



3 



STATUS OF CLAIMS 

Claims 1-34 are pending in the application. Claims 1-34 currently stand rejected. The 
rejections of claims 1-34 are appealed. 
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STATUS OF AM ENDMENTS 

No amendments have been filed subsequent to the issuance of the final office action. 
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SUMMARY OF CLAIMED SUBJECT MATTER 

The invention as claimed is summarized below with reference to the independent claims 
and to the claims argued separately. The claims contain reference numerals and reference to the 
specification and drawings. All references are shown in the application at least where indicated 
herein. 

(Claim 1) A method (12, Fig. 2, p. 4, II. 18-30; and p. 9, I 16-p. 15, 1. 9) for accessing 
network data (14, Fig. 1, p. 4, 11. 18-30; p. 5, 1. 28-p. 6, L 8; p. 9, 11, 9-15; p. 10, 1L 5-10; p. 1 1, 
L 1 3-p. 13, L 28; and p. 14, 1L 12-33) associated with a document (16, Fig. 1, p. 4, 11. 18-30; p. 
6, L 9-p. 7, 1 14; p. 10, 11, 17-22), comprising: 

converting (44, Fig. 2, p. 9, IL 26-3 1 ) at least a portion (17, Fig. L p. 4, 11. 1 8-30; 
p. 6, L 26-p. 7, L 14; p. 9, 11 26-3 1) of said document (16) to electronic format (16% Fig. 
1, p. 4, 11. 18-30; p. 6, 1. 26-p, 7, 1 14; p. 9, 11. 26-31; p. 10, 11 5-22; p. 14, 11. 5-33) with 
a digital capture input device (18, Fig, 1, p. 4, 11. 18-30; p. 6, 1, 9-p, 8, L 21), the at least 
a portion (17) of said document (16) having one or more indicia (20, Fig. 1, p. 4, 11. 18- 
30; p. 10, 1. 1 1-p. 12, 1. 6) thereon, the digital capture input device (18) being operatively 
associated with a network (22, Fig. 1, p. 5, 1. 28-p. 6, 1. 8; p. 8, L 3-p. 9, 1. 8); 

analyzing(52, Fig, 2, p. 10,1. ll-p. 11, L 12; p. 13, 11. 22-28) the at least a portion 
(17) of said document (16) in electronic format (16') to obtain said one or more indicia 
(20); 

using (54, Fig. 2, p. 11, 1. 13-p. 12, L 6) said one or more indicia (20) to locate 
(48, Fig. 2, p. 1 0, 11. 2-5) said network data ( 1 4), said network data ( 1 4) not including said 
document (16), said network data (14) being maintained at another device operatively 
associated with the network (22); and 
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automatically accessing (56, Fig. 2, p. 9, 11. 4-8; p. 12, 1, 7-p, 13, L 11) said 
network data (14). 

(ClaimS) The method (12) of claim 1, wherein accessing (56) said network data (14) 
comprises receiving said network data (14) at an email account (28, Fig. 1, p. 12, 11. 20-21). 

(Claim 13) The method (12) of claim 1 , wherein said one or more indicia (20) comprises one 
or more words, and wherein using (54) said one or more indicia (20) to locate (48) said network 
data (14) comprises: 

determining a frequency for each of said one or more words (p. 11,1. 13-p. 12, L 

6); 

comparing the frequencies of said one or more words with a word frequency list; 

and 

using the results of said frequency comparison to locate (48) said network data 

(14). 

(Claim 15) Apparatus (1 0, Fig. 1 , for accessing network data (14, Fig. 1 , p. 4, 11 1 8-30; p. 
5, 1. 6, L 8; p. 9, 11. 9=15; p. 10, 11. 5-10; p. 11, I 13-p. 13, L 28; and p. 14, 11. 12-33) 

associated with a document (16, Fig. 1, p. 4, 1L 18-30; p. 6, L 9-p. 7, L 14; p. 10, IL 17-22), 
comprising: 

one or more computer readable storage media; and 

computer readable program code stored on said one or more computer readable 
storage media, said computer readable program code comprising: 

program code for analyzing (52, Fig, 2, p. 10, I I I -p. 11, I. 12: p. 13, II. 22-28) 
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at least a portion (17, Fig. Up. 4, 11. 18-30; p. 6, I 26-p. 7, 1. 14; p. 9, 1L 26-31) of said 
document (16) to obtain one or more indicia (20, Fig. l,p. 4, Ih 18-30; p. 10, L 11 -p. 12, 
L 6) from the at least a portion (1 7) of said document (16) after the at least a portion (17) 
of said document ( 1 6) has been converted (44, Fig. 2, p. 9, 11. 26-3 1 ) to electronic format 
(16', Fig. 1, p. 4, 1L 18-30; p. 6, L 26-p. 7, L 14; p. 9, 11. 26-31; p. 10, 11. 5-22; p. 14, 11. 
5-33) with a digital capture input device (1 8, Fig. 1, p. 4, 11. 18-30; p. 6, 1. 9-p, 8,1.21) 
operatively associated with a network (22, Fig. 1 , p. 5, 1. 28-p. 6, 1, 8; p. 8, 1, 3-p. 9, 1. 8); 

program code for using (54, Fig. 2, p. 1 1 , L 1 3-p. 12,1. 6) said one or more indicia 
(20) to locate (48, Fig. 2, p. 10, 11 2-5) said network data (14), said network data (14) not 
including said document ( 1 6), said network data (14) being maintained at another device 
operatively associated with the network (22); and 

program code for automatically accessing (56, Fig. 2, p. 9, 1L 4-8; p. 12, L 7-p. 13, 

I. 11) said network data (14). 

(Claim 23) The apparatus of claim 15, wherein said program code for accessing (56) said 
network data (14) comprises program code for sending said network data (14) to an email 
account (28, Fig. 1, p. 12, 11. 20-21), 

(Claim 28) The apparatus of claim 15, wherein said one or more indicia (20) comprises one 
or more words, and wherein said program code for using (54) said one or more indicia (20) to 
locate (48) said network data (14) comprises: 

program code for determining a frequency for each of said one or more words (p. 

II, 1. 13-p. 12,1.6); 

program code for comparing the frequencies of said one or more words with a 
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word frequency list; and 

program code for using the results of said frequency comparison to locate (48) 
said network data (14). 

(Claim 29) A system for accessing network data (14, Fig. 1, p. 4, 11. 1 8-30; p. 5, 1. 28-p. 6, L 
8; p, 9, 11 9-15; p. 10, 11. 5-10; p. 11, 1. 13-p. 13, 1. 28; and p. 14, 1L 12-33) associated with a 
document(16, Fig. 1, p. 4, 1L 18-30; p. 6, 1. 9-p. 7, 1. 14; p. 10, 1L 17-22), comprising: 

a digital capture input device (18, Fig. 1, p. 4, 11. 18-30; p. 6, L 9-p. 8, I. 21) 
operatively associated with a network (22, Fig. 1, p. 5, L 28-p. 6, 1 8; p. 8, 1. 3-p. 9, L 8), 
said digital capture input device (18) converting (44, Fig.2, p. 9, 1L 26-31) at least a 
portion (17, Fig. 1, p. 4, 11, 18-30; p. 6, 1 26-p. 7, 1. 14; p, 9, 11. 26-31) of said document 
(16) to electronic format (16% Fig. 1, p. 4, 11. 18-30; p. 6, 1. 26-p. 7, 1. 14; p. 9, 1L 26-31; 
p. 10, 11 5-22; p. 14, 11. 5-33), the at least a portion (17) of said document (16) having one 
or more indicia (20, Fig. 1, p. 4, 11. 18-30; p, 10, 1. 1 1-p. 12, L 6) thereon; 
one or more computer readable storage media; 

computer readable program code stored on said one or more computer readable 
storage media, said computer readable program code comprising: 

program code for analyzing (52, Fig. 2, p. 10, L 1 1-p, 1 1, 1. 12; p. 13, 11. 22-28) 
the at least a portion (1 7) of said document (1 6) in electronic format (16') to obtain said 
one or more indicia (20); 

program code for using (54, Fig, 2, p, 1 1, L 1 3-p. 12,1.6) said one or more indicia 
(20) to locate (48, Fig. 2, p. 1 0, 1L 2-5) said network data (14), said network data (14) not 
including said document ( 1 6), said network data ( 1 4) being maintained at another device 
operatively associated with the network (22); and 
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program code for automatically receiving (56, Fig. 2, p. 9, IL 4-8; p. 12, 1. 7 -p. 13, 
1.11) said network data (14) at said digital capture input device (18). 

(Claim 34) Apparatus for accessing network data (14, Fig. 1, p. 4, II. 18-30; p. 5, 1. 28~p. 6, 
1- 8; p. 9, 11. 9-15; p. 10, 11. 5-10; p. 11,1, 13-p. 13, 1. 28; and p. 14, 11 12-33) associated with a 
document (16, Fig. 1, p. 4, IL 18-30; p. 6, L 9-p, 7, 1, 14; p, 10, 11 1 7-22), comprising: 

means for converting (44, Fig. 2, p, 9, IL 26-31) at least a portion (17, Fig. 1, p. 

4, IL 18-30; p. 6, L 26-p. 7, L 14; p. 9 7 IL 26-31) of said document (16) to electronic 
format(16\ Fig, 1, p. 4, IL 18-30; p. 6, L 26~p. 7, L 14; p. 9, IL 26-31; p. 10, IL 5-22; p. 
14, IL 5-33), said means (44) being operatively associated with a network (22, Fig, 1 , p, 

5, L 28-p, 6, L 8; p. 8, L 3-p, 9, L 8), the at least a portion (17) of said document (16) 
having means for locating (20, Fig, l,p. 4,11. 18-30; p. 10, L 11-p. 12, L 6) said network 
data (14); 

means for analyzing (52, Fig. 2, p. 10, 1. 1 1-p. 11, L 12; p. 13, IL 22-28) the at 
least a portion ( 1 7) of said document ( 1 6) in electronic format ( 1 6 ') to obtain said means 
for locating (20) said network data (14): 

means for using (54, Fig. 2, p, 11,1, 13-p. 12, L 6) said means for locating (20) 
said network data (14) to locate (48, Fig, 2, p, 10, IL 2-5) said network data (14), said 
network data (14) not including said document (16), said network data (14) being 
maintained at another device connected to the network (22); and 

means for automatically accessing (56, Fig, 2, p. 9, IL 4-8: p. 12, 1. 7-p. 13, L 11) 
said network data (14), 
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GROUNDS OF REJECTION TO BE REVIEWED ON APPEAL 

L Whether claims 1-7, 9-12, 14-22, 24-27, and 29-34 are unpatentable under 35 
U.S.C. §1 02(a) as being anticipated by Mitchell et al, U.S. Patent No. 5,693,966 (Mitchell). 

2. Whether claims 8 and 23 are unpatentable under 35 U.S.C. § 1 03(a) as being 
obvious over Mitchell. 

3. Whether claims 13 and 28 are unpatentable under 35 U.S.C. § 103(a) as being 
obvious over Mitchell in view of Block ct aL^ U.S. Patent No. 6,295,543 (Block). 
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ARGUMENT 

ISSUE 1: WHETHER CLAIMS 1-7, 9-12, 14 22, 24-27, AND 29-34 ARE 
UNPATENTABLE UNDER 35 U^S.C. §102(a) AS BEING ANTICIPATED 
BY MITCHELL ETAL., U.S. PATENT NO* 5,693,966 (MITCHELL), 

Opening Statement : 

Mitchell's hyperlinks do not comprise "network data" in the context of the pending 
claims: They are merely pointers to other parts of the document itself To the extent that the 
examiner construes Mitchell's hyperlinks to comprise additional network data, that construction 
is not reasonable in the context of the teachings of Mitchell. At best, Mitchell's teachings in this 
regard are ambiguous, and an ambiguous reference cannot be used to support an anticipation 
rejection. Moreover, and even if Mitchell is unambiguous in this regard (which is denied), the 
fact that Mitchell fails to provide an enabling disclosure as to the type, identification, access, and 
retrieval of "network data" (in the language of the pending claims), means that Mitchell cannot 
support an anticipation rejection under Section 102. 

Legal Standard For Rejecting Claims 
Under 35 U.S.C. SI 02 

The standard for lack of novelty, that is, for "anticipation," under 35 U.S.C. Section 1 02 

is one of strict identity. "'Every element of the claimed invention must be identically shown in 

a single reference."' In re Bond, 910 R2d 83 1, 832 (Fed. Cir. 1990) (quoting Divers i tech Corp, 

v. Century Steps, Inc., 850 F.2d 675, 677 (Fed. Cir. 1988)). "These elements must be arranged 

as in the claim under review. . . " In re Bond, 9 1 0 F.2d at 832. That is, "any degree of physical 

difference, however, slight, invalidates claims of anticipation." EX du Pontde Nemours & Co, 

v. Polaroid Graphics Imaging Inc., 706 F. Supp. 1 135, 1 142 (D. Del), affd without opinion, 887 

F.2d 1095 (Fed. Cir. 1989). As the Board has explained, "It is well established that an 
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anticipation rejection cannot be predicated on an ambiguous reference. Rather, disclosures in a 
reference relied on to prove anticipation must be so clear and explicit that those skilled in the art 
will have no difficulty in ascertaining their meaning." Ex parte Allen, 2004 WL 4980908, *2 
(Bd. Pat. App. & Interf 2004) (citing In re Turlay, 304 F.2d 893, 899 (C.C.P.A. 1962)). 

Functional limitations in apparatus claims may not be disregarded in evaluating 
patentability. Ex parte Williams, 1997 WL 1935445, *2 (Bd. Pat. App, Sc Interf 1997). Where 
"limitations set forth a function . , . , the reference apparatus must be structurally capable of 
performing . . . " that function. Id,, 1997 WL 1935445, *2. Anticipation may be proved by 
inherency; "[i]nherency, however, may not be established by probabilities or possibilities. The 
mere fact that a certain thing may result from a given set of circumstances is not sufficient," 
Continental Can Co. USA, Inc. v. Monsanto Co,, 948 F.2d 1264, 1.269 (Fed. Cir. 1991). Where 
the examiner rejects an apparatus claim by arguing that a functional limitation is inherent in the 
apparatus claimed, the Board has held that "it is well settled that an Examiner must provide some 
evidence or scientific reasoning to establish the reasonableness of his belief that the functional 
limitation in question is an inherent characteristic of the prior art." Ex parte Clarke, 2004 WL 
77426, *3 (Bd. Pat. App. & Interf 2004). 

In determining whether the claim elements are present in a single reference, the "'claims 
are to be given their broadest reasonable interpretation consistent with the specification ... as 
it would be interpreted by one of ordinary skill in. the art.'" In re American Academy of Science 
Tech Center, 367 F.3d 1359, 1364 (Fed. Cir. 2004) (quoting In re Bond, 910 F.2d at 833). This 
"interpretation must be consistent with the one that those skilled in the art would reach." In re 
Cortrighu 165 F.3d 1353, 1358 (Fed. Cir. 1999). 

The Examiner's Rejections 
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The examiner rejected claims 1-7,9-12, 1 4-22, 24-27, and 29-34 under 35 U.S.C § 1 02(a) 
as being anticipated by Mitchell. These rejections are improper because Mitchell fails to disclose 
or suggest a method wherein indicia on a document is used to access additional information or 
"network data" (i.e., information that is not part of the document itself). 

Mitchell describes an automated documented formatting system for converting paper 
documents into electronic documents. While Mitchell* s system is capable of using the document 
itself to create an index of the various portions of the document, the index can only be used to 
retrieve other portions of the document. Significantly, the index cannot be used to retrieve 
additional information or "network data" as required by each of the rejected claims. Put in other 
words, the links that Mitchell generates as a result of the indexing process are confined to the 
document itself. By definition, the document itself cannot comprise "network data," where 
"network data" are defined to not include the document itself. In contrast to Mitchell, the present 
invention uses the indicia obtained from the document to retrieve additional information, i.e., 
"network data" that do not include the document itself. 

In response to this argument, the examiner states (in section 9 of the final office action) 

that "hyperlinks within the document are considered as additional network data other than the 

document." The examiner then cites to a parenthetical comment contained in col. 7, lines 51-53 

of the Mitchell reference to support this contention: 

"(links to other pages or HTML documents can also be inserted as needed for 
special formats)." 

This parenthetical comment is made at the end of the paragraph contained in coL 7, lines 42-53 

wherein Mitchell describes a method for coding of the pages of the document. The entire 

paragraph is reproduced below: 

"The method described encodes the document by forming an HTML (or 
SGML) page for each page in the source paper document. Each HTML page 
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(typically a separate file, but multiple pages can be accommodated in a single file) 
contains an image of a document page and hyperlinks to four other pages, 
including the index page containing the Table of Contents, the section page 
containing the beginning of the section, the previous page of the document, and 
the next page of the document (links to other pages or HTML documents can also 
be inserted as needed for special formats). FIG. 6 shows the resulting HTML 
page." 

It is not clear whether the "other . . . HTML documents" referred to in the parenthetical 
comment means other HTML pages of the same document or HTML pages of a documen t other 
than the one that is being converted into electronic form. Appellant reads the cited language as 
referring to other HTML pages of the same document because that language occurs at the end of 
the paragraph that describes the HTML coding of the document pages (i.e., of the same 
document). While it might be possible to read the cited language in accord with the examiner's 
construction, such a construction would be unreasonable in view of the other teachings of 
Mitchell. That is, Mitchell contains no other teachings consistent with this construction, 
Mitchell does not describe, even by example, what such other network data (other than the 
document) might be, much less how to use information (e.g., indicia in the language of the 
pending claims) to retrieve such network data. To the contrary, all of the teachings of Mitchell 
are directed to issues relating to the conversion of a paper document into electronic form. 

Therefore, it is the position of the Appellant that the only reasonable construction of the 

parenthetical comment of Mitchell is that it refers to other HTML pages of the same document, 

as opposed to HTML pages of a document other than the one that is being converted into 

electronic form. At best, the teachings of Mitchell in this regard are ambiguous. However, as 

the Board has explained, 

"It is well established that an anticipation rejection cannot be predicated on an 
ambiguous reference. Rather, disclosures in a reference relied on to prove 
anticipation must be so clear and explicit that those skilled in the art will have 
no difficulty in ascertaining their meaning." Ex parte Allen, supra, (emphasis 
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added). 

Because Mitchell as ambiguous as to this point, Mitchell cannot be used to support an 
anticipation rejection under Section 102, 

In addition, even if Mitchell were unambiguous in this regard (i.e., that the hyperlinks are 
outside the document itself), Mitchell still would not anticipate the pending claims, because 
Mitchell is non-enabling as to whether hyperlinks could be used to access information outside 
the document itself. That is, Mitchell does not provide teachings, sufficient to a person having 
ordinary skill in the art, that would allow such a person to practice the invention without undue 
experimentation. Mitchell contains no further description of the nature of such data that would 
be "other than the document itself (in the language of the pending claims), nor does Mitchell 
provide any examples of such data, as does the present invention. As described on page 4, line 
32 through page 5, line 2 of the present invention, network data in the context of the present 
invention comprises additional information, such as, for example, "price, options, specifications, 
coupons, purchase order forms, purchase incentives, company information, warranties, etc." 
about the document. Clearly, Mitchell's hyperlinks are not this type of additional information 
because Mitchell's hyperlinks are to the document itself Of course, Mitchell provides no hint 
as to how such data (other than the document itself) might be accessed and retrieved. 

Accordingly, even if Mitchell's teachings are unambiguous (which is denied), the fact that 
Mitchell's teachings are not enabling as to the type, identification, access, and retrieval of 
"network data" (in the language of the pending claims), means that Mitchell cannot support an 
anticipation rejection under Section 102. See, for example, Helifix Ltd. v. Blok-Lok, Ltd., 208 
F3d 1339, 1346 (Fed. Cir. 2000)(quoting In re Paulsen, 30 F.3d 1475, 1478-79 (Fed. Cir. 
1994)): 

"To be anticipating, a prior art reference must disclose 'each and every limitation 
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of the claimed invention^ ] ... must be enahlmg[J and [must] describe ... [the] 
claimed invention sufficiently to have placed it in possession of a person of 
ordinary skill in the field of the invention/" 

In summation, claims 1-7, 9-12, 14-22, 24-27, and 29-34 are not anticipated by Mitchell 
because Mitchell fails to disclose a method wherein indicia on a document is used to access 
additional information (i.e., information that is not part of the document itself). At best, the 
Mitchell reference is ambiguous in this regard, and an anticipation rejection cannot be predicated 
on an ambiguous reference. Even if Mitchell is deemed unambiguous in this regard, Mitchell still 
cannot anticipate the pending claims because Mitchell is not enabling, 

ISSUE 2: WHETHER CLAIMS 8 AND 23 ARE UNPATENTABLE UNDER 35 U.S.C. 
§103(a) AS BEING OBVIOUS OVER MITCHELL. 

Legal Standard For Rejecting Claims 
Under 35 U,S.C. §103 

The test for obviousness under 35 U.S.C. § 103 is whether the claimed invention would 

have been obvious to those skilled in the art in light of the knowledge made available by the 

reference or references. In re Donovan, 184 USPQ 414, 420, n. 3 (CCPA 1975). It requires 

consideration of the entirety of the disclosures of the references. In re Rinehart, 1 89 USPQ 1 43, 

146 (CCPA 1976). All limitations of the claims must be considered. In re Boe, 1 84 USPQ 38, 

40 (CCPA 1974). In making a determination as to obviousness, the references must be read 

without benefit of appellants' teachings. In re Meng, 181 USPQ 94, 97 (CCPA 1974). In 

addition, the propriety of a Section 103 rejection is to be determined by whether the reference 

teachings appear to be sufficient for one of ordinary skill in the relevant art having the references 

before him to make the proposed substitution, combination, or other modifications. In re Lintner% 

173 USPQ 560, 562 (CCPA 1972). 
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A basic mandate inherent in Section 1 03 is that a piecemeal reconstruction of prior art 



patents shall not be the basis for a holding of obviousness. It is impermissible within the 

framework of Section 1 03 to pick and choose from any one reference only so much of it as will 

support a given position, to the exclusion of other parts necessary to the full appreciation of what 

such reference fairly suggests to one of ordinary skill in the art. In re Kamm, 172 USPQ 298, 

301-302 (CCPA 1972). Put somewhat differently, the fact that the inventions of the references 

and of the appellants may be directed to concepts for solving the same problem does not serve 

as a basis for arbitrarily choosing elements from references to attempt to fashion appellants' 

claimed invention. In re Donovan, supra^ at 420. 

In the case of In re Wright, 6 USPQ2d 1 959 (Fed, Cir. 1 988) (restricted on other grounds 

by In re Dillon, 1 6 USPQ2d 1 897 (Fed. Cir. 1 990), the Court of Appeals for the Federal Circuit 

decided that the Patent Office had improperly combined references which did not suggest the 

properties and results of the appellants' invention nor suggest the claimed combination as a 

solution to the problem which appellants' invention solved. The CAFC reached this conclusion 

after an analysis of the prior case law, at p, 1 961 : 

"We repeat the mandate of 35 U.S.C.§ 1 03: it is the invention as a whole 
that must be considered in obviousness determinations. The invention as a whole 
embraces the structure, its properties, and the problem it solves. See, e.g., Cable 
Electric Products, Inc. v, Genmark, Inc., 770 F,2d 1015, 1025, 226 USPQ 881, 
886 (Fed. Cir, 1985) ("In evaluating obviousness, the hypothetical person of 
ordinary skill in the pertinent art is presumed to have the 'ability to select and 
utilize knowledge from other arts reasonably pertinent to [the] particular problem ' 
to which the invention is directed"), quoting /n re Angle, 444 F.2d 1 1 68, 1 1 71-72, 
170 USPQ 285, 287-88 (CCPA 1971); In re Anionic, 559 F,2d 618, 619, 195 
USPQ 6, 8 (CCPA 1977) ("In delineating the invention as a whole, we look not 
only at the claim in question... but also to those properties of the subject matter 
which are inherent in the subject matter and are disclosed in the Specification") 
(emphasis in original). 

The determination of whether a novel structure is or is not "obvious" 
requires cognizance of the properties of that structure and the problem which it 
solves, viewed in light of the teachings of the prior art. See, e.g.. In re Rinehart, 
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531 K2d 1048, 1054, 189 USPQ 143, 149 (CCPA 1976) (the particular problem 
facing the inventor must be considered in determining obviousness); see also 
Lindemann Maschmenfabrik GmbH v. American Hoist and Derrick Co. , 730 F,2d 
1452, 1462, 221 USPQ 481, 488 (Fed. Cir. 1984) (it is error to focus "solely on 
the product created, rather than on the obviousness or notoriousness of its 
creation") (quoting General Motors Corp. v. US. Int 7 Trade Comm % 687 F.2d 
476, 483,215 USPQ 484, 489 (CCPA 1982), cert, denied, 459 U.S. 1 1 05 (1983). 

Thus the question is whether what the inventor did would have been 
obvious to one of ordinary skill in the art attempting to solve the problem upon 
which the inventor was working. Rinehart 531 F.2d at 1054, 189 USPQ at 149; 
see also In re Benno, 768 F,2d 1340, 1345, 226 USPQ 683, 687 (Fed. Cir, 1985) 
("applicant's problem" and the prior art present different problems requiring 
different solutions")." 

More recently, the CAFC has reiterated the necessity that motivation be identified in 
choosing to combine prior art references for an obviousness type rejection. As stated by the 
Court of Appeals for the Federal Circuit in In re Rouffet, 47 USPQ2d 1453 (Fed. Cir. 1998) at 
1457: 

"[Virtually all [inventions] are combinations of old elements." 
Environmental Designs, Ltd. V. Union Oil Co,, 713 F.2d 693, 698, 218 USPQ 
865, 870 (Fed.Cir. 1983)( tL Most, if not all, inventions are combinations and 
mostly of old elements."). Therefore an examiner may often find every element 
of a claimed invention in the prior art. If identification of each claimed element 
in the prior art were sufficient to negate patentability, very few patents would ever 
issue. Furthermore, rejecting patents solely by finding prior art corollaries for the 
claimed elements would permit an examiner to use the claimed invention itself 
as a blueprint for piecing together elements in the prior art to defeat the 
patentability of the claimed invention. Such an approach would be "an illogical 
and inappropriate process by which to determine patentability." Sensonics r Inc. 
v^AerosonicCorp^Hl F3d 1566, 1570, 38 USPQ2d 1551, 1554 (Fed.Cir. 1996). 

To prevent the use of hindsight based on the invention to defeat 
patentability of the invention, this court requires the examiner to show a 
motivation to combine the references that create the case of obviousness. In other 
words, the examiner must show reasons that the skilled artisan, confronted with 
the same problems as the inventor and with no knowledge of the claimed 
invention, would select the elements from the cited prior art references for 
combination in the manner claimed." 



A reference which teaches away from the appellants' invention may not properly be used 
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in framing a 35 U.3.C §103 rejection of appellants' claims. See United States v, Adarns, 148 
USPQ 429 (1966). 

The Examiner's Rejections 

The examiner rejected claims 8 and 23 under 35 U.S.C. § 103(a) as being obvious over 
Mitchell for the reasons stated in the final office action. The examiner's rejections are improper 
in thai it is not reasonable to view Mitchell's hyperlinks as corresponding to the ""network data" 
of the pending claims, Mitchell's hyperlinks are mere pointers to other places in the document 
itself; they cannot be used to retrieve additional information relating to the document. 

Moreover, even if Mitchell's hyperlinks are regarded as network data, they clearly are not 
the type of network data contemplated by the present invention. That is, Mitchell's hyperlinks 
do not comprise any of the exemplary types of additional information or network data specified 
in the written description (e.g., price, options, specifications, coupons, purchase order forms, 
purchase incentives, company information, and warranties), nor does Mitchell even suggest that 
his hyperlinks might comprise such information. 

Consequently, because Mitchell's hyperlinks do not comprise "network data" in the 
context of the claimed invention, and because Mitchell nowhere discloses or suggests that they 
might comprise the type of additional information given by example in the written description, 
Mitchell cannot establish the required prima-facie case of obviousness. Therefore, claims 8 and 
23 are not obvious over Mitchell. 

ISSUE 3: WHETHER CLAIMS 13 AND 28 ARE UNPATENTABLE UNDER 35 
U.SX. §103(a) AS BEING OBVIOUS OVER MITCHELL IN VIEW OF 
BLOCK ETAL^ U.S. PATENT NO. 6,295,543 (BLOCK). 

The Examiner's Rejections 
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The examiner rejected claims 13 and 28 under 35 U.S.C. §103(a) as being obvious over 
Mitchell in view of Block for the reasons stated in the final office action. The examiner's 
rejections are improper in that neither Mitchell nor Block contain the suggestion or incentive 
required to combine them in the manner required by the pending claims. Moreover, even if 
Mitchell and Block were combined, the resulting combination would still fail to meet the 
limitations of the pending claims. 

While the Block reference does disclose determining word frequency, Block uses word 
frequency for a different purpose. Namely, to empirically correlate a word and a class. However, 
this is not the same as the limitations of claims 1 3 and 28 which include at least "using the results 
of said frequency comparison to locate said network data/' Notwithstanding this different 
purpose, even if it were proper to combine Block with Mitchell (which is denied), the resulting 
combination would still fail to meet the limitations of the pending claims, in that it is not 
reasonable to view Mitchell's hyperlinks as corresponding to the "network data" of the pending 
claims. As discussed above, Mitchell's hyperlinks are mere pointers to other places in the 
document itself; they cannot be used to retrieve additional information relating to the document. 
The only information Mitchell's hyperlinks retrieve is other parts of the scanned document. 

In addition, even if Mitchell's hyperlinks are regarded as network data, they clearly are 
not the type of network data contemplated by the present invention. That is, Mitchell's 
hyperlinks do not comprise any of the exemplary types of additional information or network data 
specified in the written description (e.g., price, options, specifications, coupons, purchase order 
tonus, purchase incentives, company information, and warranties), nor does Mitchell even 
suggest that his hyperlinks might comprise such information. Therefore, neither Mitchell nor 
Block can be used to establish the required prima-facie case of obviousness of claims 1 3 and 28. 
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CONCLUSION 



Claims 1-7, 9-12, 14-22, 24-27, and 29-34 are not anticipated by Mitchell because 
Mitchell fails to disclose a method wherein indicia on a document is used to access additional 
information (i.e., information that is not part of the document itself). At best, the Mitchell 
reference is ambiguous in this regard, and an anticipation rejection cannot be predicated on an 
ambiguous reference. Even if Mitchell is deemed unambiguous in this regard, Mitchell still 
cannot anticipate the pending claims because Mitchell is not enabling. 

Claims 8 and 23 are allowable because Mitchell's hyperlinks are mere pointers to other 
places in the document itself. Mitchell's hyperlinks cannot be used to retrieve additional 
information relating to the document, thus do not comprise "network data" as required by the 
pending claims. Claims 13 and 28 are not obvious over Mitchell in view of Block because 
neither reference provides the suggestion or incentive required to combine them in the manner 
required by the claims 1 3 and 28. Therefore, Appellant respectfully requests the Board to reverse 
the rejections of claims 1-34. 




BmcTSTDabTEsq^ 



Attorney for Appellant 
PTO Registration No. 33,670 
DAHL & CHETLIN, L.L.C. 
555 17 th Street, Suite 3405 
Denver, CO 80202 
(303)291-3200 



Date: ( { ~~ 2^$ ~€>£> 



22 



CLAIMS APPENDIX 



1 . A method for accessing network data associated with a document, comprising: 
converting at least a portion of said document to electronic format with a digital 

capture input device, the at least a portion of said document having one or more indicia 
thereon, the digital capture input device being operatively associated with a network; 

analyzing the at least a portion of said document in electronic format to obtain 
said one or more indicia; 

using said one or more indicia to locate said network data, said network data not 
including said document, said network data being maintained at another device 
operatively associated with the network; and 

automatically accessing said network data. 

2. The method of claim 1, wherein said one or more indicia comprises at least a 
portion of the text on the at least a portion of said document, 

3. The method of claim 1 , further comprising: 

pro viding the at least a portion of said document with one or more tags before the 
at least a portion of said document is converted to electronic format with said digital 
capture input device; 

analyzing the at least a portion of said document in electronic format to obtain 
said one or more tags; and 

using said one or more tags to locate said network data. 
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4. The method of claim 3 , wherein said one or more tags comprise machine-readable 

data. 

5. The method of claim 1 , wherein accessing said network data comprises receiving 
said network data at said digital capture input device. 

6. The method of claim 5, further comprising displaying at least a portion of said 
network data on display apparatus operatively associated with said digital capture input device. 

I. The method of claim 5, further comprising printing at least a portion of said 
network data on printer apparatus operatively associated with said digital eapture input device. 

8. The method of claim 1 , wherein accessing said network data comprises receiving 
said network data at an email account, 

9. The method of claim 1 , wherein accessing said network data comprises receiving 
said network data at a network device. 

1 0. The method of claim 1 , wherein accessing said network data comprises displaying 
at least a portion of said network data. 

I I , The method of claim 1 s further comprising sending the at least a portion of said 
document in electronic format from said digital capture input device to said another device. 
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12. The method of claim I , wherein analyzing the at least a portion of said document 
in electronic format to obtain said one or more indicia comprises using character recognition to 
obtain said one or more indicia. 

1 3 . The method of claim 1 ? wherein said one or more indicia comprises one or more 
words, and wherein using said one or more indicia to locate said network data comprises: 

determining a frequency for each of said one or more words; 

comparing the frequencies of said one or more words with a word frequency list; 

and 

using the results of said frequency comparison to locate said network data. 

1 4. The method of claim 1 , wherein said digital capture input device is a multifunction 

device. 

1 5. Apparatus for accessing network data associated with a document, comprising: 
one or more computer readable storage media; and 

computer readable program code stored on said one or more computer readable 
storage media, said computer readable program code comprising: 

program code for analyzing at least a portion of said document to obtain one or 
more indicia from the at least a portion of said document after the at least a portion of said 
document, has been converted to electronic format with a digital capture input device 
operatively associated with a network; 

program code for using said one or more indicia to locate said, network data, said 
network data not including said document, said network data being maintained at another 
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device operatively associated with the network; and 

program code for automatically accessing said network data. 

16. The apparatus of claim 15, wherein said digital capture input device is a 
multifunction device. 

1 7. The apparatus of claim 15, wherein said one or more indicia comprises at least a 
portion of the text on the at least a portion of said document. 

1 8. The apparatus of claim 1 5, further comprising one or more tags provided on the 
at least a portion of said document; and wherein said computer readable program code further 
comprises: 

program code for analyzing the at least a portion of said document in electronic 
format to obtain said one or more tags; and 

program code for using said one or more tags to locate said network data, 

19. The apparatus of claim 18, wherein said one or more tags comprise machine- 
readable data, 

20. The apparatus of claim 1 5, wherein said program code for accessing said network 
data comprises program code for receiving said netw r ork data at said digital capture input device. 

2 1 . The apparatus of claim 20, wherein said computer readable program code further 
comprises program code for displaying at least a portion of said network data on display 
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apparatus operatively associated with said digital capture input device. 

22. The apparatus of claim 20, wherein said computer readable program code further 
comprises program code for printing at least a portion of said network data on printer apparatus 
operatively associated with said digital capture input device. 

23. The apparatus of claim 1 5, wherein said program code for accessing said network 
data comprises program code for sending said network data to an email account. 

24. The apparatus of claim 1 5 ? wherein said program code for accessing said network 
data comprises program code for sending said network data to a network device. 

25 . The apparatus of claim 1 5, wherein said program code for accessing said network 
data comprises program code for displaying at least a portion of said network data. 

26. The apparatus of claim 1 5, wherein said computer readable program code further 
comprises program code for sending the at least a portion of said document in electronic format 
from said digital input capture device to said another device. 

27. The apparatus of claim 15, wherein said program code for analyzing the at least 
a portion of said document in electronic format to obtain said one or more indicia comprises 
program code for using character recognition to obtain said one or more indicia. 

28. The apparatus of claim 1 5, wherein said one or more indicia comprises one or 
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more words, and wherein said program code for using said one or more indicia to locate said 
network data comprises: 

program code for determining a frequency for each of said one or more words; 
program code for comparing the frequencies of said one or more words with a 
word frequency list; and 

program code for using the results of said frequency comparison to locate said 
network data. 

29. A system for accessing network data associated with a document, comprising: 
a digital capture input device operatively associated with a network, said digital 

capture input device converting at least a portion of said document to electronic format, 
the at least a portion of said document having one or more indicia thereon; 
one or more computer readable storage media; 

computer readable program code stored on said one or more computer readable 
storage media, said computer readable program code comprising: 

program code for analyzing the at least a portion of said document in electronic 
format to obtain said one or more indicia; 

program code for using said one or more indicia to locate said network data, said 
network data not including said document, said network data being maintained at another 
device operatively associated with the network; and 

program code for automatically receiving said network data at said digital capture 
input device. 

30. The system of claim 29, wherein said digital capture input device is a 
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multifunction device. 



31. The system of claim 29, wherein said one or more indicia comprises at least a 
portion of the text on the at least a portion of said document. 

32. The system of claim 29, further comprising one or more tags provided on the at 
least a portion of said document; and wherein said computer readable program code further 
comprises: 

program code for analyzing the at least a portion of said document in electronic 
format to obtain said one or more tags; and 

program code for using said one or more tags to locate said network data. 

33. The system of claim 32, wherein said one or more tags comprise machine-readable 

data. 

34. Apparatus for accessing network data associated with a document, comprising: 
means for converting at least a portion of said document to electronic format, said 

means being operati vely associated with a network, the at least a portion of said document 
having means for locating said network data; 

means for analyzing the at least a portion of said document in electronic format 
to obtain said means for locating said network data; 

means for using said means for locating said network data to locate said network 
data, said network data not including said document, said network data being maintained 
ai another device connected to the network; and 

29 



means for automatically accessing said network data. 
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References Relied on By Ex aminer in Final Office Action. 

Copies of the following references are attached hereto for the Board's convenience: 

1 . U.S. Patent No. 5,693,966, "Automated Capture of Technical Documents for Electronic 
Review and Distribution/' of Mitchell ei aL 



2. U.S. Patent No. 6.295,543, "Method of Automatically Classifying a Text Appearing in. 
a Document when said Text has been Converted into Digital Data," of Block et aL 
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[57] ABSTRACT 

Paper documents are automatically converted into a 
hypertext-based format so that they can be accessed through 
electronic networks, including the Internet, or via non- 
volatile transfer media such as disks or CD-ROMs. The 
invention generalizes the concept of form-based recognition 
while extending the concept of document retrieval to include 
document structure knowledge, thereby providing the 
advantages found in both form-based recognition 
(utilization of document structure knowledge) and image- 
based information retrieval (robustness). In a preferred 
embodiment, a method according to the invention enables 
direct translation of a paper document into a hypertext-based 
format so that it may be directly accessed through the 
Internet using current browsers such as Mosaic, Netscape 
and Microsoft's Explorer. 
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Part 3. Phase i Work Plan 

The TfvMo-iETM conversion task can be decomposed into several subproblems 
which include the following: Text Recognition (or OCR); Recognition of References 
and the creation of the appropriate pointers; the Conversion of Graphical Figures into 
electronic format; Understanding Table formats; Identification and Conversion of 
Procedural Data; and the Understanding of Warnings, Cautions, and Notes found 
within a TM, tt is our intent to evaluate the potential for automating each of these 
components of the translation process. 

Because text recognition plays a central part of the entire TM translation process 
(text, table cells, reference pointers, procedural data steps, and the body of warnings, 
cautions, and notes) its performance is especially important for ETM conversion. 
Textual word translation can be improved by using known word lexicons to transform 
the recognition from character ievel to word leveL It is important to note however that 
this technique cannot be used to improve number recognition, since all possible 
combinations of numbers are legal, and thus other contextual information must be 
used (e.g. known number sequences, table lists, etc.) As the devel opers of an 
acknowledged high performance numeric character recognizer^ {BmB32^ ^BB§BB 
we have developed techniques that can be used to increase numeric recognition 
performance, and intend to apply those techniques to this portion of the problem. 



References can occur throughout the TM to provide pointers to the following: models 
or types, government specifications and standards, temperature readings, instrument 
readings, switch positions and panel markings, U.S. standard unit measurements, 
figure numbers, figure numbers followed by an index number, parts on diagrams, 
tables, other paragraphs in the same manual, other subordinate paragraphs, ether 
TM identification numbers, footnotes, series of items, or data applicable to a 
sentence or paragraph. The references contained within a TM are generally textual 
by nature, and as such fall within the text recognition problem. In general, however, 
reference recognition is more difficult than simple text recognition because it often 
lacks any contextual information. There is a great deal of redundancy contained 
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#!/bir»/sh 

echo Content-TYPE: text/html 
echo 

if [ $# = 0 ] # is the njumber of arguments — 0? 
then # do this part if there are NO arguments 

echo "<HEAD>" 

echo "<TITLE>Cybernet Proposal Search</TITLE>" 
echo "<ISINDEX>" 
echo "</HEAD>" 
echo "<BODY>" 

echo ,l <H1>Cybernet Proposal Search</H1>" 
echo "Enter your search in the search field. <P>" 
echo "This is a case insensitive substring search: thus" 
echo "searching for 'cat' will find 'Cat' and 'CATEGORY'." 
echo "</BODY>" 

else # this part if there ARE arguments 

echo "<HEAD>" 

echo "<TITLE>Result of search for \"$*\".</TITLE>" 
echo H </HEAD>" 
echo "<BODY>" 

echo "<H1>Result of search for \"$*\\</H1>" 
#echo "<PRE>" 
for i in $* 
do 
grep -i $i 

/extern 1 /people/ganz/public Jitmi/pages_txt/page*_op.txt i 
/usr/Soca!/etc/httpd/cgi-bin/format_grep 

done 
#echo "</PRE>" 
echo "</BODY>" 
fi 
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Result of search for "vector*. 
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output vector that exists in within the recognizer before some decision 
technique 



vector is constructed which contains the area in the image of each 
possible "characteristic 

segments from the thinned text. They encode this information into fixec 
length vectors and 
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AUTOMATED CAPTURE OF TECHNICAL 
DOCUMENTS FOR ELECTRONIC REVIEW 
AND DISTRIBUTION 

REFERENCE TO RELATED APPLICATION 

This application claims priority of U.S. provisional appli- 
cation Ser, No, 60/006,372, tiled Nov, 8, 1995, the entire 
contents of which are incorporated herein by reference. 

FIELD OF THE INVENTION 

The present invention relates generally to document 
processing, and, in particular, to automated document con- 
version to digital form for broadcast over networks, includ- 
ing the World Wide Web, and/or transfer media such as disks 
or CD-ROM. 

BACKGROUND OF THE INVENTION 

Two major components of document conversion are page 
decomposition and text recognition (OCR). Page decompo- 
sition identifies the overall layout of a document page, 
whereas OCR identifies the ASCII components found within 
the page components, 

A page decomposition or segmentation module accepts an 
input document page, and processes it into its constituent 
parts, including text, tables, references, procedural data, and 
graphics. Because page decomposition occurs first in the 
processing chain, it is one of the most important modules of 
any automatic document understanding system. It is very 
difficult to recover from any errors that occur at this stage of 
processings and as such it is important that a very reliable 
page decomposition module be developed. 

Most page segmentation methods can be classified into 
three broad categories: bottom -up, top-down and hybrid. 
The bottom-up strategies usually begin with the connected 
components of the image and merge them into larger and 
larger regions. The components are merged into words, 
words into lines, lines into columns, etc., until the entire 
page is completely assembled. In the top-down approaches, 
the page is first split into blocks, and these blocks are 
identified and subdivided appropriately, often using projec- 
tion profiles. The hybrid methods combine aspects of both 
the top-down and bottom-up approaches. The page can be 
roughly segmented by a sequence of horizontal and vertical 
projections and then connectivity analysis is used to com- 
plete the segmentation, 

Existing document conversion techniques include: 1) 
OCR (optical character recognition) document conversion. 
2) form -based OCR, and 3) combined image and OCR 
systems. The first technique understands and translates the 
document into its individual components. The second uti- 
lizes spatial, and content constraints of document forms to 
increase OCR reliability. The third leaves the document in 
its image form, but also utilizes OCRed text to provide 
indexing into the image database. 

An OCR document conversion system which translates 
document images into ASCII and graphics components is 
illustrated in FIG. 1. The approach begins with page 
decomposition, wherein each page is segmented into graphic 
and text regions. Algorithms to process these regions are 
incorporated into the system. Graphical operations include 
extracting text from within graphics and raster-to-vector 
conversion- OCR is computed for the text regions, as well as 
any text located within the graphical regions. An integration 
step combines the results of the graphical and textual 
processes into a final electronic format. 
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The ASCII and graphics format obtained in this system is 
the only record of the document which is stored, as no 
images of the original document are kept. Since any errors 
introduced during page decomposition are propagated for* 

5 ward in the system, it is important that the graphical and 
textual regions of the page be correctly identified. Thus, the 
automatic regions general ed must be manually checked or 
the page must be manually separated into regions. It is also 
important that the text be correctly recognized to ensure that 

3a the document is correctly captured. This requires that the 
OCR results be manually scanned and any errors corrected „ 
Similarly, the text within graphics must be correctly 
extracted and recognized, again requiring manual checking. 
This manual checking can involve large cost which is often 
prohibitive for many applications. 

Form -based OCR systems use the spatial and content 
constraints found in document forms lo increase OCR 
conversion performance. According to this approach, a 
'"form" may be defined as a document containing data 

^ 0 written in fields that are spatially stable on the document. All 
form processing systems require the creation of a template 
showing the system what a particular form looks like and 
where to find the fields to read. 

Form identification decides which master form was used 

25 in a given image, and passes thai information to the form 
removal functions. Form removal strips all standard data 
from the scanned forms, including lines, instructions and 
examples, leaving only the information entered into the form 
by the applicant. The space necessary to store images is 

30 reduced by allowing users to save stripped forms and later 
add back the master form data before displaying or printing. 

Full function form processors, in addition to machine 
print, also read hand printed characters and optical marks 
such as checkboxes. Form processing systems offer a broad 

35 array of tools, some standardized and some custom, which 
exploit information specific to a particular form type for 
error detection or correction. Restrictions can be applied to 
fields to increase recognition accuracy. For example, the 
field masking tool permits individual characters within a 

40 field to be recognized exclusively as either an alpha char- 
acter or a numeric character. Thus, a social security number 
fieid can be required to contain only nine numeric characters 
and possibly two dashes. Segmentation of the held is then 
limited to 9 or 11 characters and recognition is limited to 10 

45 digits and 1 dash. Additional tools include external table 
lookups, where field contents can be compared to a table of 
possible responses and the best match selected; checks of 
digit computations, where errors are detected by performing 
a mathematical operation on a field's contents and cornpar- 

5Q ing the result to a predetermined total contained in the same 
field; and range limits, where the number hi the field is 
checked to determine if it is within a valid range. 

Unlike generic OCR systems, which only allow the error 
tolerance level to be set globally, form processing systems 

55 permit error tolerance levels (confidence levels) to be set on 
a field level. Thus, important fields, where field importance 
differs based on the form, can have their confidence levels 
set to a threshold requiring higher OCR accuracy. 

Combined OCR and image systems utilize both data 

60 sources to provide robust retrieval, as illustrated in FIG. 2. 
The approach does not convert the document page directly 
to ASCII and graphics formal, but rather saves a bit-mapped 
image of the document page and devises indexing schemes 
to retrieve specific pages. When a page is retrieved opera - 

65 lions can then be performed on the areas of interest. For 
example, OCR can be computed on selected areas, or 
graphics can be extracted. 
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Once the text regions of a document have been Identified, 
the characters within the regions need to be recognized. 
There are many different approaches to character 
recognition, but they can be generally grouped into two main 
categories: template-based methods and feature-based meth- 
ods. Template methods maintain a collection of sample 
letters and identify a component in question by finding the 
closest-matching template. Feature methods, on the other 
hand, try to break the component into a collection of 
"features" by identifying where strokes join or curve sig- 
nificantly. 

The classic template solutions compare each component 
to a collection of models representing all possible letters in 
all po&sible fonts. Inns, templates must be created for each 
of the different fonts. Contrarily* feature based recognition 
algorithms need not be tuned to individual typefaces, 
because they are based on finding characteristic features of 
each letter For example, regardless of the typeface, a 
lowercase "t" consists of a strong vertical stroke crossed 
with a horizontal stroke. Thus, the feature based methods 
attempt to find this essence of the letter. 

Each of the OCR techniques has its benefits and short- 
comings, Combining the various methods in a voting 
scheme can overcome the limitations of each of the indi- 
vidual methods. In a voting scheme, the results of each of the 
OCR modules are passed to a decision module to determine 
a final recognition result. Since the decision module has 
knowledge about each of the OCR modules, it can determine 
the best possible answer. 

The decision module can keep track of the character 
results and which OCR methods presented the correct 
response to the decision module. For example, if three 
methods report that the input character is a "B** and one 
method decides the character is an "8", the decision module 
will likely choose "B" as the best result. Further^ the module 
that made the mistake will be noted for the next time. This 
adaptive learning approach allows the system to learn from 
its mistakes. 

It is important to note that voting systems perform best 
when the hypotheses from the OCR systems are of high 
accuracy. When a text region is degraded and difficult to 
read, there is usually much disagreement among the 
recognizers, which is difficult for a voting system to resolve. 

Each year, the Information Science Research Institute 
(ISRI) at the University of Nevada, Las Vegas (UNLV) 
conducts a test of the performance of various OCR systems, 
many of which are commercially available. Although 
recently tested OCR systems do not quite reach 100%, 
current recognition rates are impressive and improvement is 
ongoing. Achieving the last few percent is always the most 
difficult part, but OCR developers are steadily increasing 
their performance. With the incorporation of a voting 
scheme, the recognition rates increase even more. 

If the OCR generated text is to be used in a text retrieval 
application, the percentage of words correctly recognized by 
the OCR sy stem is of considerable interest, in a text retrieval 
system, the documents are retrieved from a database by 
matching search terms with words in the document. Thus, 
the word accuracy of the OCR-generated text is very impor- 
tant. Common words* such as "and/* ''of/* "the/* ctc. t 
usually provided no retrieval value in an indexing system. 
These words are termed stop words, and all other words are 
termed non-stop words. It is the recognition rate of these 
non-stop words that is of greatest importance to text retrieval 
applications. 

If OCR is to be used as a conversion process to input 
technical manuals into ASCII form, manual checking and 
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correcting of the OCR of the text will be necessary. Assum- 
ing an OCR character accuracy rate of 99%, a page with 
4000 characters would result in 40 character errors per page. 
The issue becomes the cost of this manual correction versus 
5 the effectiveness of OCR, i.e. is it cheaper to correct the 
OCRed version or simply retype it? 

To answer this question, we conducted a test using the 
OmniPage Professional OCR product to determine the time 
needed to correct an OCRed document versus the time 
10 needed to retype the document. It was assumed thai the 
documents can be scanned and OCRed in a batch mode with 
the results saved to a file for future manual correction. Thus, 
only the actual labor costs are measured, not any time spent 
scanning and recognizing the document. 

Seven pages from various documents were scanned and 
OCRed. The pages chosen for this test were quite simple, but 
included different fonts, and bold and italic characters. They 
contained single columns of text, no graphics and very few 
underlined sentences, since underlines tended to present a 
20 problem to the OmniPage recognizer. A bibliography page 
was also included in the set to introduce digits (from dates 
and page numbers) and proper nouns (author's names) 
which cannot be automatically corrected by dictionary look- 
ups. 

OmniPage offers a method to check its OCR results. Any 
characters that the system has a difficult time recognizing are 
highlighted and the original image of the word in the context 
of the original page is presented to the user for possible 
correction. This method does not flag all OCR errors and 
presents numerous correct characters for viewing. Thus, this 
process was not incorporated in our timing test. The person 
correcting the text did not use this feature of the OmniPage 
system, but was allowed to use Microsoft Word spell 
35 checker to flag possible misspellings for correction. 

Each OCRed page was manually corrected, and the cor- 
rection time recorded. It took approximately 56 minutes to 
correct the seven pages. Assuming a typist can type 50 wpro, 
the time to retype these pages is 74 minutes or about 62 
40 minutes at 60 wpm. From these numbers, it appears that 
OCRing the documents may be slightly more beneficial. 
However, a closer review of the manual corrections is 
needed, 

The typical OCR errors include character omissions, 

45 additions and substitutions, bold and italic typeface errors, 
and incorrect spacing. Most of the missed errors were words 
recognized as bold typeface which were not bold in the 
original documents. The bibliography page (page five in the 
tables above) proved to be quite a challenge for the OCR 

50 system with fifty detected errors. The page was included 
because of its intermingling of digits and characters, and its 
inclusion of proper names and acronyms. This type of text 
must be carefully reviewed for errors. It is not like regular 
paragraphs where the corrector can simply read the flow of 

55 the sentences to check it. Dates, page numbers, and author's 
names must be carefully checked. Indeed, although the time 
required for retyping and manually correcting the pages in 
our test set were similar, the manual correction stage still left 
many errors uncorrected. Depending on the accuracy 

6 0 required, each page may need to be corrected by more than 
one person, thus doubling the time of manual correction. 

The results of this experiment confirmed that OCR tech- 
nology cannot be used to convert documents (either auto- 
matically or semi-aulomaticaliy) in a cost-effective manner. 

65 More cost-effective methods are desperately needed, 
however, to convert existing large-scale, paper-document 
data bases into electronic form. Within the U.S. Government 
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community, for example, reauthoring technical manuals into 
hypertext format costs between S200 and Si 500 per page. 

The use of hypertext documents has proven as a costef- 
fective tool for supporting military equipment maintenance 
through the Department of Defense (DoD) Computer-aided 
Acquisition and Logistic Support (CALS) program. In this 
program, a hypertext format (IETM) was used for storing 
textual, graphical, audio t or video data in a revisabte data- 
base. The IETM form enables the electronic data user to 
locate information easily, and to present it faster, more 
comprehensibly, more specifically matched to the 
configuration, and in a form that requires much less storage 
than paper. Power troubleshooting procedures not possible 
with paper Technical Manuals are possible using the com- 
putational capability of the IETM Display Device. 

At the center of the IETM concept is the Interactive 
Electronic Technical Manual DataBase (IETMDB). This 
data structure is constructed from composite nodes which 
form the basic units of information within the IETMDB. 
These nodes are comprised of primitives, relationships to 
other pieces of information, and context attributes. The 
primitives include text, tables, graphics, and dialogs, The 
IETMDB is "format-free" in that it does not contain pre- 
sentation information. As such, it does not impose structural 
requirements on the actual Data Base Management System 
(DBMS) methodology in use. 

In summary, a hypertext-based approach to document 
conversion has potential for large-scale projects. However, 
in order to serve a greater technical and digital library 
community, existing hypertext approaches will need to be 
extended, to include more general encoding, revising, and 
distribution capabilities applicable to electronic technical 
data and documents. 

SUMMARY OF THE INVENTION 

Broadly, the present invention provides a method for 
converting paper documents into a hypertext-based format 
so that they can be accessed through networks such as the 
Internet or on media such as disk or CD-ROM. The method 
eliminates many of the errors normally associated with 
existing document conversion in a very cost effective 
manner, thus providing a low-cost, high-performance solu- 
tion for converting existing paper-based databases into a 
form thai can be accessed through the information highway. 

More particularly, the method generalizes the concept of 
form-based recognition while extending the concept of 
document retrieval to include document structure 
knowledge, thereby providing the advantages found in both 
form-based recognition (utilization of document structure 
knowledge) and image-based information retrieval 
(robustness). 

In one embodiment, the process utilizes the SGML, format 
as & primary translation target to leverage ongoing work in 
developing this data format standard and tools. SGML is 
also the required markup language for the CALS text files as 
established by DoD standard MIL-M-28001, and thus, this 
design can be used to convert technical documents to this 
military standard. 'The invention also enables direct transla- 
tion into HTML format (a subset of SGML), and thus 
provides a mechanism for translating documents into a 
format that can be accessed through the Internet using 
current browsers like Mosaic or Netscape. Accordingly, the 
method can be used lo create electronic documents tor a 
rapidly increasing Internet user population, which is cur- 
rently estimated to be growing at a rate of over 70,000 new 
users each month. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a diagram which represents a classic OCR 
approach to document conversion; 

FIG. 2 is a diagram which represents a combined image 
and GCRed texi approach to document conversion; 

FIG. 3 is an overview of a system design according to the 
invention; 

FIG. 4 is a screen display of an HTML, image-based 
document; 

FIG. 5 is a diagram illustrating an HTML generator 
according to the invention; 

FIG. 6 is a HTML page for the system of FIG. 5; 

FIG. 7 is a screen display of Mosaic page generated 
according to the invention; 

FIG. 8 is a file map for the pages generated according to 
FIGS. 6 and 7; 

FIG. 9 is a section of text which shows words associated 
with hyperlinks stored as reverse video; 

FIG. 10 is a shell program to perform search and display 
results; 

FIG. 11 show results of search for me word "vector"; 
FIG. 12 is a scan of text at 300 dpi; 
FIG. 13 is a scan of text at 72 dpi; 
FIG. 14 is a screen capture of a 72 dpi image; and 
FIG. 15 is a stripe decomposition with 3 layers and 3 
stripes, 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

The present invention resides in a generalized OCR 
engine which extends techniques found in form -based OCR 
and comblned-image-text OCR technologies. An overview 
of the method corresponding applications is illustrated in 
FIG. 3, and includes two primary components: l)translation 
from paper to electronic form (SGML, HTML, text), and 2) 
a user application for reading these electronic documents. As 
also seen in the figure, the translation component is buiit 
from a generalized OCR engine and an authoring environ- 
ment, 

The generalized OCR engine according to this invention 
combines desirable features found in the form -based recog- 
nition systems and image-retrieval systems to produce a 
solution that is both cost-effective (in its solution) and 
elegant (in its generality) by generalizing the concept of 
form-based recognition to include both logical (document 
structure) with physical (spatial location). 

The authoring environment provides a mechanism not 
only for controlling the document conversion process 
overall, but for adding to documents, once converted. As it 
is inevitable thai the converted documents will need to be 
modified or extended, the complete document conversion 
process may be embedded within a multimedia editor which 
combines imaging, word processing, and multimedia tools 
in support of document extension. Conversion is only the 
first step. Active use of the resulting electronic documents is 
essential for sustaining value for many of the targeted 
databases. This approach is essential for maximizing the 
overall value of the document conversion system. 

The user application (browser) component allows the user 
to access the resultant document using several different 
potential user applications, depending on the form of the 
electronic document. One example is an Internet Browser. 
By translating the document into HTML format (or an 
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extended version of HTML), Internet users may easily 
access the translated documents regardless of geography. 
This is a powerful idea, in that it eliminates existing bound- 
aries which limit the use of the knowledge contained within 
many existing paper-based databases. 5 
Prototype HTML document Format Description 

To demonstrate the potential of creating Internet 
documents, a prototype HTML document based on both 
images and OCRed text was implemented. A representative 
document, scanned from its paper form into an HTML w 
document, is viewable at the following World Wide Web 
address: http: // wwwxybernetxom/-~ganz/ietm_index.html 
using an HTML browser such as Mosaic, MS Explorer or 
Netscape, A simple search routine has also been 
implemented, which allows the document to be searched as 
using the dirty OCRed text. A screen shot of this prototype 
document is illustrated in HG. 4. Note that the Table Of 
Contents found within this figure contains items that are 
hyperlinked (using document structure knowledge) to vari- 
ous document images that can be brought into the viewer. 20 
Also note that the slider bar can be used to move sequen- 
tially through the document in a manner similar to reading 
a paper document from start to finish. 

In developing this prototype, a demonstration document 
conversion method was also developed wherein the pages of 25 
the paper document were scanned and OCRed. Any errors in 
the automatic zone detection were corrected, thus all text 
and graphics were correctly separated and labeled. However, 
no effort was made to correct the resulting OCR, thus the 
text database was uncorrected or "dirty". FIG, 5 provides an 30 
overview of the process used in developing this prototype. 

As illustrated in the figure by dashed lines, the parsing of 
the sample paper document into a table of contents and list 
of figures was done manually, although this could also be 
automated. An HTML page was created with hyperlinks to 35 
connect each item in the Table of Contents and List of 
Figures with its corresponding HTML document page. A 
document understanding algorithm using document struc- 
ture knowledge is preferably used to parse the text and create 
the Table of Contents, List of Figures and appropriate 40 
hyperlinks. 

Method for HTML Coding of Pages 

The method described encodes the document by forming 
an HTML (or SGML) page for each page in the source paper 
document. Each HTML page (typically a separate file, but 45 
multiple pages can be accommodated in a single file) 
contains an image of a document page and hyperlinks to four 
other pages, including the index page containing the Table of 
Contents, the section page containing the beginning of the 
section,, the previous page of the document, and the next so 
page of the document (links to other pages or HTML 
documents can also be inserted as needed for special 
formats). FIG. 6 shows the resulting HTML page. 

The display generated by the HTML page on a Mosaic 
browser is illustrated in FIG, 7. As seen in the figure, the 55 
words between <TITLE> and <TTTLE> in the HTML page 
appear in the title section of the Mosaic browser. The 
<BODY> section of the HTML page creates the four buttons 
[Index] [Up] [Previous] and [Next] seen in the Mosaic 
display. It also establishes the hyperlinks between the but- 60 
tons and the specific proposal page to be viewed when a 
button is selected. For convenience, these buiions appear at 
both the top and bottom of the displayed page. 

The <BODY> section of the HTML page also identifies 
the page image to be displayed when this HTML page is 65 
accessed. The <IMG> tag provides the location of the image, 
as well as its height and width. The browser uses the height 
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and width information to place a properly-sized bounding 
box upon encountering the IMG tag and continues laying out 
the document text, with no performance delay to go discover 
the width and height of the image over the network. 

Also included in the <IMG> tag is the 1SMAP specifica- 
tion. This attribute indicates that the image is active and 
clicking inside the image may cause hyperlinks to be 
accessed. The ISMAP tag requires that a map file for the 
image be included. When a user clicks on the image, the 
coordinates of the click are passed to a gateway program 
image map and the appropriate map file is accessed to 
determine the hyperlink. FIG. 8 shows the map file for page 
6 of the prototype document. Clicking on the appropriate 
page image within the rectangle (89, 258, 199, 269), which 
corresponds to the reverse video area containing the words 
"(U.S.RS, Zip Codes)/' accesses page .18 of the prototype. 
Clicking anywhere else within the image accesses the no 
operation shell, no op.sh, and no change will be made. 

r rhe images of each of the 25 pages in the prototype 
document are stored in the Graphics Interchange Format 
(GIF) format. Although some browsers support multiple 
image formats, there are three formats that are always 
viewable, GIF images, X- Bitmaps and X-Pixelmaps. Both, 
the X-Bitmap and X-Pixeimap formats store the image data 
as ASCII text formatted as Standard C character string 
arrays, and, as such, are an inefficient way of storing large 
images, Thus, GIF is currently the most common image 
format in World Wide Web applications. 

The GIF format can store black-and-white, grayscale or 
color images, with a limit of 256 colors per image. The 
image data in GIF format is always compressed using the 
Ixmpel-Ziv-Welch (LZW) compression scheme. Thus, the 
images are stored in compressed format* and algorithms read 
the compressed GIF files without an intermediate step of 
having to uncompress the entire image. 
Method for Creating Image Pages 

To create the binary images of document pages, each page 
is scanned. In this example data scanning was done at 300 
dots-per-inch, although higher or lower resolutions are pos- 
sible. Since the sample document pages contain large white 
borders that do not contribute any information, the images 
were cropped to remove the borders and then resealed to 
8,0x10,35 at 300 dpi resolution. The resulting sample 
images are approximately 910 KBytes in size, which would 
be much too large to send across a telephone network in a 
timely manner (it would be acceptable over Tl networks or 
better and as stored data on a CDROM). 

Since most monitors can only display 72 dpi, it is best if 
only 72 dpi images are sent across the network. Otherwise, 
data that cannot be displayed is being wastefully sent across 
the network. A downsampling routine was needed to rescale 
the images from 300 dpi down to 72 dpi in a manner that 
preserves the strokes of the characters in the image. The 
JBIG compression method, described in detail in a subse- 
quent section, with ils progressive coding and sophisticated 
resolution reduction algorithm was used to optimally down* 
sample the images to 72 dpi. Finally^ these 72 dpi JBIG 
images were converted to GIF format for a final file size of 
about 20 KBytes. 
Hyperlinks 

To establish hyperlinks within the document images, the 
HTML attribute ISMAP is used. This features allows images 
to be made fully active. W 7 hen the user clicks inside the 
image, the coordinates of the click are sent to an image map 
program. An image map database file corresponding to the 
image relates the region selected to a specific hyperlink. The 
hyperlink is then accessed. 
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Since the text images are stored in binary format, words 
or phrases which have hyperlinks are converted to reverse 
video, as shown in FIG. 9, to make ihem distinguishable to 
the user. If the user clicks anywhere within the reverse video 
area, the hyperlink wiil be activated. Clicks outside the 
reverse video areas will cause no action. 
Method for Text Searching within Documents 

A search strategy may be implemented in the method as 
follows. When a search string is entered, the uncorrected 
OCR of the document pages is searched. The entire line of 
text containing the matching string is retrieved, Next* an 
HTML page is formatted to contain the page on which the 
string matched , as well as the full text Eine containing the 
string. Additionally, hyperlinks are established that allow the 
user to click on the page number and have the document 
page retrieved, FIG. 10 shows the shell program that is used 
to perform the search and display the results. 

In the sample implementation, each document page was 
scanned, and OCRed using the OmniPage professional soft- 
ware package. No attempt was made to correct the OCR and 
just me text recognition for each page was retained.. No 
graphics were saved. These text files served as the database 
for the search and retrieval algorithms. 

The UNIX "grep" command was used to implement the 
search routine for the demonstration. "Grep" searches files 
for a pattern and prints all lines that contain that pattern. The 
results of the "grep" were sent to a formatting program 
which ereates a hypertext page with the results. FIG. 11 
shows the results of a search for the string "vector" in the 
prototype document. Clicking on either the highlighted 
''page 13" or *'page 16" causes the respective pages to be 
retrieved for viewing. 
Data Compression and Display 

The tradeoff between display quality and document image 
size is a major design consideration for image-based HTML 
documents. Documents stored as S-bit grayscale images 
provide adequate image quality when displayed on a stan- 
dard computer screen, but take too long to transmit across 
Internet connections. Documents stored as compressed 1-bit 
images can be easily transmitted across the Internet, but lack 
sufficient display quality. 

The remainder of this section describes the issues asso- 
ciated with, this tradeoff, and proposes a quality solution to 
this problem. 
Issues 

In operation, it was observed that the GIF images found 
in the conventional HTML standard were not ideally suited 
for document images. Document images compressed under 
this format were difficult to read unless stored as large 
gray -scale images, and transmitting such images across the 
network is very time consuming. 

There are three factors involved in determining document 
image size: dots per inch (dpi), bits per pixel (bpp), and 
compression technique. The challenge is to tradeoff these 
factors to obtain minimal storage and sufficient quality. 

Added to this problem is the factor that many computer 
monitors are low resolution, and thus require Iow f dpi display 
of document images. If document images are stored at 300 
dpi and are displayed on a 72 dpi monitor, much of the 
document data is discarded when the 300 dpi image is 
mapped to the 72 dpi display. FIG. 12 shows a portion of a 
300 dpi, 1 bpp image. The character seen in this figure are 
crisp and legible. 

When the image is downsampled to 72 dpi, much of the 
image quality is lost* as illustrated in FIG. 13- As seen in this 
figure, this downsampling can produce such effects as bro- 
ken characters, touching characters, eliminated character 
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features, and merged characters. In some cases, this process 
can create ambiguous character representations* 

When such downsampled images are displayed on a 
computer screen the human eye tends to overlook many of 
these deficiencies. However, not all of these problems are 
overlooked, and the resulting display appears to be less than 
optimal as illustrated in FIG, 14. The appearance of such low 
resolution images can be somewhat improved by storing 
more than 1 bpp, since shades of gray can be used to ease 
the jagged character features. However, this comes at a high 
data storage and transmission cost. 

The third factor, data compression, plays an equally 
important role in this problem. Lossless compression tech- 
niques assure that no important data will be lost, but do not 
significantly reduce the amount of data required to store 
document images. Lossy compression techniques are more 
aggressive in reducing the amount of storage required, but 
can eliminate important data. 
JBIG Standard 

One compression standard deals with many of the aspects 
associated with this issue. The International Organization for 
Standardization/International Electro -technical Commission 
(ISO/TEC), in collaboration with the International Telegraph 
and Telephone Consultative Committee (CCITT), defined an 
image compression standard for lossless image coding of 
bi-level images (ISO/TEC 11544:1993). The JBIG (Joint 
Bi-level Image Experts Group) standard defines a method of 
compressing two-tone or black/white images in a bit- 
preserving manner, wherein decoded images are digitally 
identical to the originally encoded image. 

The JBIG standard can be parameterized for progressive 
coding. Thus, it is possible to transmit a low resolution 
image first, followed by resolution enhancement data. When 
decoding an image that has been progressively encoded, a 
low-resolution rendition of the original is made available 
first with subsequent doublings of resolution as more data is 
decoded. The progressive encoding mode utilizes a very 
sophisticated resolution reduction algorithm, PRES 
(progressive reduction standard), which offers the highest, 
quality low resolution versions. 

The progressive coding feature of JBIG is advantageous 
when an image is used by output devices with widely 
differing resolution capabilities. For example, when an 
image is displayed on a low resolution monitor (72 dpi), 
only that information in the compressed image required for 
reconstruction to the resolution of the display is transmitted 
and decoded, I Tien, if a higher resolution is needed for, say, 
printing to a 300 dpi printer, additional compressed data is 
transmitted and built upon the already transmitted data to 
obtain the higher resolution image for the printer. 

Progressive coding is a way to send an image gradually to 
a receiver instead of all at once. During sending, more and 
more detail is sent and the receiver can build the image from 
low to high detail. JBIG uses discrete steps of detail by 
successively doubling the resolution. The sender computes a 
number of resolution layers for the image, d, and transmits 
these starting at the lowest resolution, dl. Resolution reduc- 
tion uses pixels in the high resolution layer and some already 
computed low resolution pixels as an index into a lookup 
table. The contents of this table can be specified by the user. 

Compatibility between progressive and sequential coding 
is achieved by dividing an image into stripes. Each stripe is 
a horizontal bar with a user definable height. Each stripe is 
separately coded and transmitted, and the user can define in 
which order stripes, resolutions and bit planes (if more than 
one) are intermixed in the coded data. A progressive coded 
image can be decoded sequentially by decoding each stripe, 
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beginning with the one at the top of the image, to its full 
resolution, and then proceeding to the next stripe. Progres- 
sive decoding can be done by decoding only a specific 
resolution layer. 

FIG, 15 shows an image decomposed into three stripes, s, 
and three resolution layers, d. Each stripe s at each resolution 
d is coded into a subfile Cs,d. The JBIG file to describe the 
total image is a concatenation of header information and the 
Cs ? d subfiles. Four ways of concatenating the stripe codings 
are defined in Table L Decoders work naturally from low 
resolution up, and so prefer the first two orde rings of the 
table. 

TABLE 1 

POSSIBLE DATA OKDKRINGS 



Hi to Low 


Seq. 




Order 




0 


0 


0, l t 2 


3, 4, 5 




0 


a 


0, 3, 6 


3, 4, 7 




1 


0 


6. 7,8 


3, 4, 5 


0, 1,2 


1 


i 


6, 3,0 


7, 4, 3 


«, 5,2 



After dividing an image into bit planes, resolution layers 
and stripes^ eventually a number of small bi-leve! bitmaps 
are left to compress. Compression is done using a Q-coder 
which codes bi4evel pixels as symbols using the probability 
of occurrence of these symbols in a certain context. JBIG 
defines two kinds of context, one for the lowest resolution 
layer (the base layer), and one for all other layers 
(differential layers). Differential layer contexts contain pix- 
els in the layer to be coded, and in the corresponding lower 
resolution layer. 

The probability distribution of white and black pixels can 
be different for each combination of pixel values in a 
context. In an all white context, the probability of coding a 
white pixel will be much greater than that of coding a black 
pixel. The Q-coder assigns, just like a Huffman coder T more 
bits to less probable symbols, and so achieves compression. 
The Q-coder can, unlike a Huffman coder, assign one output 
code bit to more than one input symbol, and thus is able to 
compress bi-level pixels without explicit clustering, as 
would be necessary using a Huffman coder. Maximum 
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Since JBIG supports multiple bit planes, it is possible to 
effectively use the JBIG standard for the lossless coding of 
grayscale and color images as well. Images with eight or 
fewer bits/pixel compress well with the JBIG method, with 
more than eight bits/pixel other compression algorithms are 
more effective. In a study of international standards for 
lossless still image compression, JBIG compression was up 
to 29 percent superior to lossless JPEG compression for 
images with up to six bits/pixel, It was also found thai JBIG 
had a 1.1 to 1.5 times better compression ratio on typical 
scanned documents, compared to G4 fax compression which 
had been the best compression algorithm for scanned docu- 
ments available prior to JBIG. 

Cost Analysis vs. Existing Document Coding Methods 

Existing options for document conversion can be caiego- 
rized into four classes: OCR ? Image Database with 
Keywords, Image Database with OCR, and Structured 
Image Database with OCR. Each of these approaches to 
document conversion has its advantages and disadvantages. 
A summary of these is provided in Table 2. 

The OCR approach is the most labor expensive. It 
requires complete conversion to electronic format. As indi- 
cated by our experiments, current OCR technology has only 
achieved a level to where it is cost competitive with manual 
reentry This is very costly, but also produces the best final 
product. It uses the least amount of disk space, supports the 
best retrieval, and has optimal display quality. It just costs a 
lot to convert documents to this format. 

The Image Database with Keywords approach is the next 
most labor expensive. This approach utilizes document 
images and document keywords, The keywords are manu- 
ally obtained, and this requires extensive labor. Furthermore, 
the quality of the keywords is highly dependent on the 
expertise of the individual selecting the keywords. As such, 
the selection of keywords requires signiiicant labor from a 
highly skilled individual. Both document images and ASCII 
keywords are stored under this approach* and thus, this 
method requires significantly more storage space than the 
OCR approach. Document retrieval is limited to the key- 
words associated with each document, and thus is only as 
comprehensive as the keywords. Display quality is Hunted 
to image quality. The conversion accuracy is optimal, since 
the content of the original document is preserved in the 
document image. 



TABLE 2 
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Document 
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compression is achieved when, all prob abilities (one set to 
each combination of pixel values in the context) follow the 65 
probabilities of the pixels. The Q-coder therefore continu- 
ously adapts these probabilities to the symbols it sees. 



The Image Database with OCR approach is the most labor 
inexpensive. This approach utilizes document images and 
uncorrected (or dirty) OCR. Since no corrections are made 
to the OCR, this approach minimizes conversion labor costs. 
Both documeni images and ASCII text are stored under this 
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approach, and thus, this method requires more storage space 
than the Image Database with Keywords approach. The 
uncorrected OCR is used to retrieve images. This approach 
relies on the fact that the English language has many 
redundancies, and thus, a few OCR mistakes can be over- 
looked in the retrieval process. Display quality is limited to 
image quality. The conversion accuracy is optimal since the 
content of the original document is preserved in the docu- 
ment image. 

The Structured Image Database with OCR approach 
requires slightly more labor cost than the inexpensive Image 
Database with OCR approach. This approach utilizes docu- 
ment images, documeni structure knowledge, and uncor- 
rected OCR. Both document images and ASCII text are 
stored under this approach, and thus, this method requires 
storage space that is similar to that of the Image Database 
with OCR approach. The uncorrected OCR is used io 
retrieve images, and again relies on English language redun- 
dancies for retrieval accuracy. Document structure can also 
be used in the retrieval process to focus query retrievals. 
Display qualily is limited to image quality. The conversion 
accuracy is optimal^ since the content of the original docu- 
ment is preserved in the document image. 

Labor cost is the major price factor involved in each of 
these conversion approaches. For any large scale conversion 
task the labor costs associated with the OCR approach are 
simply prohibitive. Similarly, the labor cost associated with 
the Image Database with Keywords approach is also very 
expensive. Although not as extreme as the OCR labor costs, 
the selection of keywords involves manually categorizing 
the document contents. This requires a high degree of skill, 
and thus, is not inexpensive. The labor costs associated with 
the Image Database with OCR approach are minimal Docu- 
ments are simply scanned to create the document image 
database and OCRed to create the corresponding uncor- 
rected ASCII text. Both can be highly automated. The labor 
costs associated with the Structured Image Database with 
OCR approach is slightly higher, since it involves manually 
monitoring and correcting the document parsing process. 
Simple tools can be developed to minimize the labor 
required for this process. 

This invention employs the Structured Image Database 
wiih OCR approach. The prohibitive labor costs associated 
with the OCR and Image Database with Keyword 
approaches eliminated these approaches from further con- 
sideration. The costs associated with the Image Database 
with OCR and the Structured image Database approaches 
were most comparable. The Structured Image Database with 
OCR approach was selected over the Image DB with OCR 
approach, however, because the structure knowledge was 
viewed as necessary to support simple network interfaces, 
valuable for retrieval performance, and relatively inexpen- 
sive. 

The approach utilizes both document images and OCRed 
text. It is robust in that only document images are viewed 
(the quality of these images can be improved through the 
develop men! of text specific image downsampling routines). 
Furthermore, the document is fully hypertext linked, and, as 
such, it is easy to navigate . Additionally, the search using the 
"dirty" OCRed text provides significant flexibility for find- 
ing information within the document (which can be further 
improved by registering the dirty ASCII text to image 
zones). 

An ellic-ieni encoding method, including means for docu- 
ment searching, hyper-link indexing, and HlfML coding has 
been described which provides a cost-effective means for 
many organizations to make data, which currently exists in 
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only paper form, available to the rapidly growing number of 
computer and Internet users. Because the conversion method 
enables direct translation into HTML format, it can be used 
to create electronic documents that can be viewed by current 
Internet browsers like Mosaic or Netscape as well as disk 
and CDROM viewers (conventional word processor pro- 
grams and edit/ viewer utilities). 
That claimed is: 

1. A method of automatically coding, managing, and 
displaying a document in digital form, the method compris- 
ing the steps of: 

scanning a document into an image formal suitable for 

display purposes; 
embedding the image format into a hypertext-based meta- 
language format including one or more hypertext links; 
segmenting the hypertext-based document into one or 

more structured blocks; 
decoding a particular block into text, images, and tables, 

as appropriate, in accordance with a block-specific 

decoding strategy; and 
embedding the text derived from the block decoding into 

a conventional document format, enabling the use of a 

text-based search method. 

2. The method of claim 1, wherein GIF is used as the 
image format. 

3. The method of claim X, wherein JBIG is used as the 
image format. 

4. The method of claim 1, wherein HTML is used as Lhe 
meta-language format 

5. The method of claim 1, wherein SGML is used as the 
meta-language format. 

6. The method of claim 1, wherein the block-specific 
decoding strategy I Deludes optical character recognition. 

7. The method of claim 1, wherein the text-based search 
method is based on Boolean keyword expression matching. 

8. The method of claim 1, including the step of visually 
indicating the hypertext links. 

9. The method of claim 1, wherein the hypertext links are 
HTML-compatable URLs. 

10. The method of claim 1, wherein the meta-language is 
primarily text-based, and wherein the images are in a native 
operating system format. 

11. The method of claim 1, wherein the documents are 
stored on an Internet server enabling remote browser access. 

12. The method of claim 1, wherein the documents are 
locally stored on a disk-based medium. 

13. A method of digital document encoding and 
management, comprising the steps of: 

scanning a document into one or more page images; 

embedding at least one of the page images into a 
hypertext-based meta-language format which enables a 
user to automatically or manually segment the page 
images into document structure blocks; 

decoding a particular block into text plus images and 
tables, as appropriate, in accordance with a block- 
specific docodmg strategy, the text including one or 
more non-proofread (dirty) sections; and 

embedding the text, including the dirty sections, into a 
conventional document format, enabling the use of a 
text-based search method. 

14. The method of claim 13, further including the step of 
transmitting the scanned image, including dirty text sections, 
for display purposes. 

15. The method of claim 13, wherein the step of embed- 
ding at least one of the page images into a hypertext-based 
meta-language format further supports hypertext linkage 
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through coding or attaching of hypertext links to subimage 
locations over word subimages within the full page image, 

16. The method of claim 13, wherein the me la-language 
is a text-based format, and the images are in a native 
operating system format selected from among the following: 

PICT (Macintosh), 

BMP (Windows), 

TIF (Generic), 

MacDraw (Macintosh), 

XI 1 (Unix), 

COM (Generic), 
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PostScript (Generic for printer output), 
GIF (WWW/HTML), and 
JPEG (WWW/HTML) . 

17. The method of claim 13, wherein the documents are 
stored on a World Wide Web server for remote HTML 
browser access, 

18. The method of claim 13, wherein the documents are 
stored on removable, nonvolatile medium for local computer 
use. 
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(57) ABSTRACT 

The text to be classified is compared with the contents of a 
Felevance lexicon in which the significant words of the texts 
to be classified per text class and their relevance lo the text 
classes is stored. The fuzzy set is calculated which specifies 
for the significant words of the text to be classified, their 
occurrence per text class and their relevance to the text class. 
A probability calculation is used for determining the prob- 
ability with which the fuzzy set occurs per class for the 
corresponding class. The class having the highest probabil- 
ity is selected and the text is allocated to this class. 
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METHOD OF AUTOMATICALLY 
CLASSIFYING A TEXT APPEARING IN A 
DOCUMENT WHEN SAID TEXT HAS BEEN 
CONVERTED INTO DIGITAL DATA 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates generally to a method for 
classifying text by significant words in the text. 

2. Description of the Related Art 

From the reference A. Dengei et aL, 'Office Maid — A 
System for Office Mail Analysis, Interpretation and 
Delivery', Int. Workshop on Document Analysis Systems, a 
system is known by means of which, for example, business 
letter documents can be categorized and can then be 
forwarded, or stored selectively, in electronic form or paper 
form. For this purpose, the system contains a unit for 
segmenting the layout of the document, a unit for optical text 
recognition,, a unit for address detection and a unit for 
contents analysis and categorization. For the segmentation 
of the document, a mixed bottom-up and iopniown approach 
is used, the individual steps of which are 
Recognition of the contiguous components, 
Recognition of the text lines, 
Recognition of the letter segments, 
Recognition of the word segments, and 
Recognition of the paragraph segments. 

The optical text recognition is divided into three parts: 
Letter recognition in combination with lexicon-based word 

verification, 

Word recognition, with the classification from letters and 
word-based recognition. 

The address recognition is performed by means of a 
unification -based parser which operates with an attributed 
context-free grammar for addresses. Accordingly, text parts 
correctly parsed in the sense of the address grammar are the 
addresses. The contents of the addresses are determined via 
character equations of the grammar. ITie method is described 
in the reference M. Malburg and A. Dengei, * Address 
Verification in Structured Documents for Automatic Mail 
Delivery 7 . 

Information retrieval techniques for the automatic index- 
ing of texts are used for the contents analysis and catego- 
rization. In detail, this takes place as follows: 
Morphological analysis of the words 
Elimination of stop words 
Generation of word statistics 

Calculation of the index term weight by means of formulas 
known from information retrieval such as, for example, 
inverse document frequency. 

The index term weights calculated in this manner are then 
used for determining for all categories a three-level list of 
significant words which characterizes the respective cat- 
egory. As described in the reference A- Dengei et al.» * Office 
Maid— A System for Office Mail Analysis, Interpretation 
and Delivery', Int. Workshop on Document Analysis 
Systems, these lists are then manually revised after the 
training phase. 

A new business letter is then categorized by comparing 
the index terms of this letter with the lists of the significant 
words for all categories, The weights of the index terras 
contained in the ietier are multiplied by a constant depend- 
ing on significance and are added together. Dividing this 
sum by the number of index terms in the letter then results 
in a probability for each class. The detailed calculations are 
found in the reference R Hoch, 'Using IR Techniques for 
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Text Classification in Document Analysis' . The result of me 
contents analysis is then a list of hypotheses sorted accord- 
ing to probabilities. 

SUMMARY OF TOE INVENTION 

The object forming the basis of the present invention 
consists in providing a method according to which the 
contents analysis of the text and thus the text classification 
is improved. In this connection, it is assumed that the text of 
the document is already available as digital data which are 
then processed further. 

This object is achieved in accordance with the method ibr 
the automatic classification of a text applied to a document 
after the text has been transformed into digital data with the 
aid of a computer, in which each text class is defined by 
significant words, the significant words and their signifi- 
cance to the text class are stored in a lexicon file for each lext 
class, a text to be allocated is compared with all text classes 
and, for each text class, the fuzzy set of words in text and 
text class and its significance to the text class is determined, 
the probability of the allocation of the text to the text class 
is determined from the fuzzy set of each text class and its 
significance to each text class, in which text class with the 
highest probability is selected and the text is allocated to this 
class. 

Further developments of the invention are provided by 
further steps, wherein the text to be classified is morpho- 
logically analyzed in a morphological analyzer preceding 
the contents analysis, the morphologically analyzed text is 
supplied to a stochastic tagger in order to resolve lexical 
ambiguities, and the tagged text is used for text classifica- 
tion. Preferably, a relevance lexicon is generated for the 
classification of the text; for this purpose, a set of training 
texts is used, the classes of which are known; the frequencies 
of the classes, of words and of words in the respective 
classes are counted from this set; an empirical correlation 
between a word and class is calculated by means of these 
frequencies; this correlation is calculated for all words and 
all classes and the result of the calculation is stored in a hie 
as a relevance of a word to a class, which file is used as a 
relevance file or a relevance lexicon. 

In one embodiment, the correlation (or relevance) 
between a word and a class is established in accordance with 
the following formula: 
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rlv(w in c) z= r(w t *?) r 



^'X^-X^X* 



where: 

N^number of training texts, 

Zwe*snumber of training texts of class c with word w, 
Xw^number of training texts with word w, Xc^number of 
training texts of class c. 

One embodiment provides that only correlations greater 
than a selected value r-max are taken into consideration, 
which value is established at a significance level of e.g. 
0.001. In such embodiment, the text to be examined and 
relevance lexicon are used for determining for each class the 
fuzzy set of significant words per class and its relevance per 
class, from the fuzzy set per class and its relevance to each 
class, the probability of its fuzzy set of relevant words is 
calculated, and the ciass with the maximum probability is 
determined from the probabilities per class and the text is 
allocated to this class. 

In this example, the probability is calculated in accor- 
dance with the formula 



US 6,295,543 Bl 



where uA is the membership function which specifies the 
extent to which the fuzzy set is allocated to a class, and 
which just corresponds to the correlation measure according 
to the above formula. 

The present method may be used for automatic diagnosis 
from medical findings, in which the medical findings are 
considered to be the text and an illness is considered to be 
a class, in which method in a training phase the knowledge 
required for the classification is automatically learned from 
a set of findings the diagnosis of which is known, and a new 
finding is classified in accordance with the technique of 
fuzzy sets. 

A case of application of the method is the automatic 
diagnosis from medical findings. If a medical finding is 
considered io be a text and an illness is considered to be a 
class, the problem of automatic diagnosis can be solved by 
means of the method of text classification. It is a consider- 
able advantage of the method that it learns the knowledge 
needed for the classification automatically and unsupervised 
from a set of findings the diagnosis of which is known. There 
is no additional effort required by the doctor who only needs 
to write down the finding as usual. The learning lakes place 
from the findings already in existence. After the training 
phase, a finding is then classified with the aid of the learned 
knowledge source and techniques of fuzzy sets. The class 
allocated to the findings corresponds to the illness diag- 
nosed. 

it is initially assumed that the text to be examined is 
already available in the form of ASCII data. 

Preceding the contents analysis of a text, a morphological 
analysis is performed which morphologically analyses (i.e. 
reduces to their stem forms) all words in the first step and 
then resolves lexical ambiguities by means of a stochastic 
tagger. A method according to the publication from M LDV- 
Forum" can be used for the morphological analysis. A 
description of the tagger used can be found in the publication 
by E. Charniak, "Statistical Language Learning^ The 
tagged text is always the starting point for all further 
processing steps. 

The text classification is training-based. From a set of 
training texts, the classes of which are known, the frequen- 
cies of classes, of words overall and of words in the 
respective classes are counted. These frequencies are then 
used for calculating the empirical correlation between a 
word and a class according to Pearson H. Weber* 'Einf 
tihrung in die Wahrscheinlichkeitsrechnung und Statistik fur 
Ingenieure', (Introduction to probability calculation and 
statistics for engineers), pp. 193-194. This correlation is 
calculated for all words and all classes and is regarded as 
relevance of a word to a class. 

Only correlations greater than a value r__max, which is 
obtained from checking the independence at a significance 
level of e.g. 0.001 are taken into consideration (see also, for 
example, H. Weber, 'Einfuhrung in die Wahrscheinlich- 
keitsrechnung und Statistik fur ingenieure*, (Introduction to 
probability calculation and statistics for engineers), p. 244). 
The result obtained is a lexicon which contains the rel- 
evances of the words to the classes, 

After a text has been morphologically analyzed, it is 
classified with the aid of this relevance lexicon, as follows: 
for each class, a fuzzy set is determined which contains all 
relevant words. The membership function ptA of the fuzzy 
set exactly corresponds to Pearson's correlation measure. To 



obtain the most probable class, the probability of its fuzzy 
set of relevant words is calculated for each class. This 
purpose is served by the formula from H. Bandemer and S. 
Gottwald, 'Einfuhrung in Fuzzy ~Melhoden*, (Introduction 
to fuzzy methods) normally used in fuzzy theory, namely: 

pr obM ) ■ = 2 f* A {x) ' ^ X} * 
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where p A is the membership function of fuzzy set A of 

relevant words of a class and p{x) is interpreted as p(x is 

relevant to A): 

p (x is relevant to A):^p(A|x)=p(x,A)/p(x) 

As a result of the classification, the class with the most 

probable fuzzy set is output. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention will be explained in greater detail with 
reference to an illustrative embodiment. 

FIG. 1 is a block diagram which shows a basic represen- 
tation of the method, 

FIG. 2 is a block diagram which shows the sequence of 
preparation of the text, 

FIG. 3 is a block diagram which shows a method for 
training the system, 

FIG. 4 is a block diagram which shows the method for 
classifying the text. 

DETAILED DESCRIPTION OF TOE 
PREFERRED EMBODIMENTS 

FIG. t shows a basic representation of the method. It is 
intended to classify the text on a paper document DOK. 
Firstly, the document DOK is scanned with the aid of a 
scanner SC and an image file BD is generated. The text to 
be classified is segmented in a layout segmentation SG and 
the text segment TXT-SG formed with the aid of the method 
known in European Patent Application 0 515 714 Al. This, 
in turn, provides an image file which now only contains the 
text part of the document. The image data of this text are 
then converted into ASCII data by means of an OCR (optical 
character recognition). These data are designated by TXT in 
FIG. 1, Using a training lexicon REE-LEX, the text classi- 
fication TXT-K is performed and thus a class hypothesis is 
generated which specifies the probability with which the text 
to be classified can be allocated to a particular class. The 
class hypothesis is called KL-H in FIG, 1. 

Preceding the contents analysis of the text TXT which is 
present in ASCII format, a morphological analysis is per- 
formed, For this purpose, all words of the text are morpho- 
logically analyzed, i.e. reduced to their stem forms, in the 
first step (with the aid of a morphological analyzer LEM 
which supplies the morphologically analyzed text L-TXT), 
and then lexical ambiguities are resolved by means of a 
stochastic tagger TAG. The result of this treatment of the 
text TXT is the tagged text T-TXT which can then be 
processed further. The operation of the morphological ana- 
lyzer LEM is described in LDV- Forum and the structure and 
function of the tagger are described in E. Charniak, "Sta- 
tistical Language Learning'*. 

1'he tagged text T-TXT is then the starting point for the 
further processing steps. 

Before the text classification can be performed, a training 
phase must be provided. In this training phase, a relevance 
lexicon R EL-LEX is generated which will be used later for 
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classifying texts, For this purpose, the frequencies of 
classes, of words overall and of words in the respective 
classes are counted from a set of training texts TXT-TR, the 
classes KL-TXT of which are known. This is done in a unit 
FR for frequency calculation in which the word frequencies 
FR-W and the class frequencies FR-KL are formed. Using 
these frequencies, the empirical correlation between a word 
and a class is calculated according to Pearson H. Weber, 
* Emm fining in die Wahrscheinlichkeitsrechnung und Stalls- 
tik fur Ingeniewe', (Introduction to probability calculation 
and statistics for engineers), pp. 193-194: 



rlv(w in c) : - r{w, c) ~ 



where: 

N*n.umber of training texts, 

Swc-number of training texts of class c with word w, 
Sw^number of training texts with word w, 
Xc^number of training texts of class c. 

This correlation is calculated for all words and all classes 
and is regarded as relevance REL of a word to a class. In this 
connection, care is taken that the correlations do not become 
too small, for which reason a value r-max is introduced 
which is set, for example, to a significance level 0.001, see 
the reference H. Weber, 'Einfuhrung in die Wahrscheinlich- 
keitsrechnung und Statistik fur Ingemeure\ (Introduction to 
probability calculation and statistics for engineers), p. 244, 
The results, that is to say the relevances of a word to a class, 
are stored in a lexicon R EL-LEX which thus contains the 
relevances of the words to the classes. 

Once the relevance lexicon REL- LEX has been 
generated, the text T-TXT to be examined can then be 
classified. For this purpose, selected words of the text which 
are of significant importance are examined from the text 
with the relationships, existing in the relevance lexicon 
REL- LEX, between the words and the classes and a 
so-called fuzzy set FUZ-R Is generated therefrom for the 
text and for each class. These fuzzy sets per class are stored 
in a file FUZ-KL. The fuzzy set per class contains the words 
of the text which occur in the class and their relevance to this 
class. From the fuzzy set, the probability of its fuzzy set of 
relevant words is calculated for each class in a unit PROB 
and stored in a file PROB-KL. For this purpose, the mem- 
bership function of the fuzzy-set is determined in relation to 
mat class which just corresponds to Pearson's correlation 
measure. The probability is calculated in accordance with 
the formula normally used in fuzzy theory and this formula 
has already been specified above and is known from H. 
Bandemer and S. Gottwald, 'Einruhrung in Fuzzy- 
Methoden', (Introduction to fuzzy methods). The class for 
which the highest probability has been calculated is selective 
in a unit MAX for maximum calculation. The text T-TXT is 
allocated to this class. This class is called TXT-KL in FIG. 
4. 

The method will be explained by means of the following 
application example: 

News from the USENET Newsgroup de. comp, os, lioux- 
.mtsc is to be sorted into the following classes: printer, 
configuration, network, sound, external memory, video, 
software, development, kernel, communication, input 
devices, SCSI, X-Windows and operating system. The first 
processing step of a text is the morphological analysis. It 
transforms, for example, the German language sentence 
Beim Starten von X kommt mil der Mirage-F32 nur ein 
wei(3er Bildschirm into the morphologically analyzed form: 



0 I Beim beim prp 

1 2 starten starten vim 

1 2 starten starten vinfin 

2 3 von von prp 
5 3 4 X x n 

4 5 kommt kornmen vfm 

5 6 mit mil prp 

5 6 mit mit vprt 

6 7 der d pron 
io 6 7 der der del 

6 7 der der relpron 

7 8 Mirage mirage n 
9 - - - 

9 10 P32 p32 n 
35 10 11 nur nur adv 
1112 ein ein det 

11 12 ein ein vprt 

12 13 weisser weiss adjflk 

13 14 Bildschirm bildschirm n 

20 13 15 Bildschirm. bildschirm. $$$ 

13 14 Bildschirm. bildschirm. SSS 

14 15 . . eos__punkt 

14 15 , - punkt 

15 16 SCRS SCR SCRS 

25 The tagger resolves the ambiguities in categories and basic 
forms: 

0 1 Beim beim prp 

1 2 starten starten vftn 

2 3 von von prp 
30 3 4 Xxn 

4 5 kommt kornmen vfin 

5 6 mit mit prp 

6 7 der der det 

7 8 Mirage mirage n 
35 8 9 -- 

9 10 P32 p32 n 

10 11 nur nur adv 

11 12 ein ein det 

12 13 weisser weiss adjflk 

40 13 14 Bildschirm bildschirm n 
14 15 . . . eos_punkt 

During the training, the following relevance lexicon was 
trained (excerpt): soundkarte_n 

<koofiguration>riv^0. 12523 
45 <neizwerk>rlv--0.033766 

<sound>riv«0.716692 

<externer speiehef>rlv=s -0.005 260 

monitor__n 

<v|deo>r Iv =0. 606806 
50 drucker_n 

<drucker>rlv=0. 683538 

<software>rlv**0. 14210 

gec n 

<entwicklung>rlv**0.684036 
55 <kemel>rlv~0. 103325 

<kommu nikalion>rlv~-0,0S3S44 
apsfilter„_ n 

<drucker>rlv=*0.56l354 
grail kkarte_n 
60 <emgabegeraete>rlv=-0.Q08924 
<konfiguratioo>rl v=0.0 17783 
<scsi>riv=~0.GG5854 
<video>r!v«0.50ll08 

65 <eiitgabegeraete>rlv-0. 023704 
<x~ winows>r lv ^058041 9 
scsi_n 
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<eingabegeraete> rlv**-0.65 260 
<kernel>rfv«~(X026075 
<konnguration>rlv=0.1"i 7458 
<netzwerk>riv—0.03567 1 
<betriebssystem>riv~-0,063972 
<scsi>rlv=0.S82414 
<sound>rlv«-0.04X297 
<externer speicher>rlv=Q. 284832 
<video>rlv=-0,107000 
ethernet n 

<kommunikation>rlv«~0 .012769 
<netzwerk>rlv=G.502532 
<beiriebssystem>rlv=*0.O1.4134 
x~~n 

<drucker>dv— 0-073611 
<eingabegeraete>rlv*0.005764 
<entwickhmg>rlv-0.G73S68 
<kernei>riv-0.005127 
<kommunikation>rlv=*~0. 108931 
<konfiguration>rhfe-0.055763 
<netzwerk>rlv=-0.077721 
<betriebssystem>rlv^-0 .046266 
<scsi>dv— 0.054152 
<sound>rlv =-0.03758 1 
<exterue speieher>rlv**-0.081716 
<software>rlv«0.037474 
<video>rlv=*(U97814 
<x-windows>rlv=*0.299126 
mirage n 

<scsi>rJv=0.065466 
<video>rlv*=0.221600 
biidschirm... n 
<drucker>rlv— -0.023347 
<emgabegeraete>rlv=0, 036846 
<entwicklung>rlv»-0. 022288 
<konfiguration>rIv=»~0.0i42S4 
<video>rfv-0.216536 
<x-wmdows>rlv»0,269369 
sUrten._yinfki 

<kommunikation>rlv**0.002855 
<konfiguratxon>rlv«0.O6O185 
<betriebssystem>rlv=0. 006041 
<externe speicher>rlv ^-0.001 856 
<x~windows>rlv**0. 260549 
staden_vfin 

<cirucker>rlv»~0.038927 
<entwicklung>rlv «~0X)37790 
<ke mel>rl v— 0.009309 
<kommu ii ikation>rlv**-0. 057605 
<konfiguration> rl v=0.035588 
<netzwerk>rlv» B 0 .045 992 
<betriebssy&tem>dv«-0.003344 
<sound>rIv»- 0.019409 
<externe speicher>rlv=*-0 .0433 1 2 
<video>rlv~0. 110620 
<x-wtndows>rlv~0. 178526 

The following fuzzy sets are then formed for the classes: 
Video~{x (0.1978l4),mirage (0.221600), biidschirm 

(0.216536)} 
X-Windows- 

{siarten (0.178526), x (0.299126), bildschirm (0.269369)} 
Furthermore, the probabilities of the following words are 
known: 
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Word 


video 


X- Win daws 


X 


0,24 


iX19 


mirage 


0.8 




bildschirm 




033 


starter 


0.24 


0.21 



10 The probabilities of the classes are calculated from this 
and from the membership functions of the words: 
Prob (Video)-0.197814*0-24+0.221600*0.8+0.216536*- 
0.43 

15 Prob (X-Windows)-O. 178526*0.21+0.299126*0. 19+ 

0. 269369*0.33 
Prob (video)=G.3 

Prob (x-wmdows)»0.I8 

Although other modifications and changes may be sag- 

20 gested by ihose skilled in the art, it is the intention of the 
inventors to embody within the patent warranted hereon all 
changes and modifications as reasonably and properly come 
within the scope of their contribution to the art. 

25 What is claimed is: 

1. A method for the automatic classification of a text 
applied to a document, after the text has been transformed 
into digital data with the aid of a computer, comprising the 
steps of: 

30 

morphologically analyzing i he text to be classified in a 
morphological analyzer preceding the contents 
analysis, 

supplying the morphologically analyzed text to a stochas- 
tic tagger in order to resolve lexical ambiguities, 

using the tagged text for text classification, 

defining each text class by significant words, 

storing the significant words and their significance to the 
40 text class in a lexicon file for each text class, 

comparing a text to be allocated with all text classes and, 
for each text class, determining a fuzzy set of words in 
text and text class and its significance to the text class, 
45 determining probability of the allocation of the text to the 
text class from the fuzzy set of each text class and its 
significance to each text class, and 

selecting text class with the highest probability is and 
allocating the text to this class. 
50 2. A method according to claim i t 

generating a relevance lexicon for the classification of the 
text, 

ibr this purpose, using a set of training texts, the classes 
55 of which are known, 

counting the frequencies of the classes, of words and of 

words in the respective classes from said set, 
calculating an empirical correlation between a word and 
^ class by said frequencies, 

calculating said correlation for all words and all classes 
and storing a result of the calculation in a file as 
relevance of a word to a class, which file is used as 
relevance file or relevance lexicon. 
65 3. A method according to claim 2, in which the correlation 
between a word and a class is established in accordance with 
the following formula: 



rfv{w in c) :±f r{w+ c) = 
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determining the class with the maximum probability from 
the probabilities per class and allocating the text to this 
class. 

6, A method according to claim 5 in which the probability 
Is calculated in accordance with the formula 



where: 

N=number of training texts, 

Xwc^number of training texts of class c with word w, 
Xw=number of training texts with word w> 
Sc^number of training texts of class e. 

4. A method according to claim 3 in which only 
correlations>a selected value r-max are taken into 
consideration, which value is established at a significance 
level of 0.001. 

5. A method according to claim 4, further comprising the 
steps of; 

using the text to be examined and relevance lexicon for 
determining for each class the fuzzy set of significant 
words per class and its relevance per class* 

calculating from the fuzzy set per class and its relevance 
to each class r the probability of its fuzzy set of relevant 
words, 



10 



is 



20 



where 4 uA is the membership function which specifies the 
extent to which the fuzzy set is allocated to a class, and 
which jusl corresponds to the correlation measure according 
to the above formula. 

7, A method according to claim 1, comprising: 
automatic diagnosis from medical findings, 
considering medical findings to be the text and consider- 
ing an illness to be a class, 
automatically learning in a training phase, the knowledge 
required for the classification from a sei of findings the 
diagnosis of which is known, 
and classifying a new finding in accordance with the 
technique of fuzzy sets. 



